When a user starts a new kernel, different options determine how much time Jupyter will wait for it to be successfully spawned before considering the operation as failed.
A common reason for timeouts to be reached is because there are no available resources to meet the kernel requirements. For example, a user may try to schedule a Slurm kernel with one GPU, but all the nodes with GPUs may be busy. As a result, Jupyter will fail and a popup error will be displayed:
Raising timeouts to start Jupyter kernels could help with this issue. However, Bright does not generally recommend doing it, because large timeouts may hamper the user experience.
Since Jupyter is designed to be an interactive environment, it usually makes sense to inform users as soon as possible when they cannot start kernels and run notebooks. It is undesirable to make a user wait for a long time before discovering the kernel can’t be spawned (and then perhaps let the user start a different one).
Bright considers the default Jupyter timeouts to be suitable for most use cases. Nevertheless, in some scenarios, it may be necessary to increase them. An example of such a scenario is when Bright’s cm-scale utility is used to provision nodes on the fly to scale the cluster up. Spinning up new nodes (physical, cloud, or virtual) may require few minutes: in this case, it makes sense for Jupyter users to wait longer for the kernel to be ready.
Configuring kernel startup timeouts involves customizing both JupyterHub options and Jupyter Enterprise Gateway options.
An administrator has to first increase how much time JupyterHub is willing to wait for its kernel provisioner (Jupyter Enterprise Gateway is used for Bright clusters) for a successful reply. Then, Jupyter Enterprise Gateway itself has to be granted more time to actually start the kernel process on a node with appropriate resources (e.g. via a workload manager job or a Kubernetes pod).
The most straightforward way to configure kernel startup timeouts is by defining two environment variables:
EG_JUPYTER_GATEWAY_CONNECT_TIMEOUTfor Jupyter Enterprise Gateway
These variables have to be defined in a new file named
An example for a 120-second timeout is provided below:
# cat /etc/default/jupyterhub-singleuser-gw export KERNEL_LAUNCH_TIMEOUT=120 export EG_JUPYTER_GATEWAY_CONNECT_TIMEOUT=120
Please note this configuration file has to be created on every Jupyter login node.
By default, the Jupyter login node is the cluster’s head node, but configurations may vary and multiple login nodes may exist.