Raising timeouts to start Jupyter kernels

Contents

When a user starts a new kernel, different options determine how much time Jupyter will wait for it to be successfully spawned before considering the operation as failed.

A common reason for timeouts to be reached is because there are no available resources to meet the kernel requirements. For example, a user may try to schedule a Slurm kernel with one GPU, but all the nodes with GPUs may be busy. As a result, Jupyter will fail and a popup error will be displayed:

Raising timeouts to start Jupyter kernels could help with this issue. However, Bright does not generally recommend doing it, because large timeouts may hamper the user experience.

Since Jupyter is designed to be an interactive environment, it usually makes sense to inform users as soon as possible when they cannot start kernels and run notebooks. It is undesirable to make a user wait for a long time before discovering the kernel can’t be spawned (and then perhaps let the user start a different one).

Bright considers the default Jupyter timeouts to be suitable for most use cases. Nevertheless, in some scenarios, it may be necessary to increase them. An example of such a scenario is when Bright’s cm-scale utility is used to provision nodes on the fly to scale the cluster up. Spinning up new nodes (physical, cloud, or virtual) may require few minutes: in this case, it makes sense for Jupyter users to wait longer for the kernel to be ready.

Configuration

Configuring kernel startup timeouts involves customizing both JupyterHub options and Jupyter Enterprise Gateway options.
An administrator has to first increase how much time JupyterHub is willing to wait for its kernel provisioner (Jupyter Enterprise Gateway is used for Bright clusters) for a successful reply. Then, Jupyter Enterprise Gateway itself has to be granted more time to actually start the kernel process on a node with appropriate resources (e.g. via a workload manager job or a Kubernetes pod).

The most straightforward way to configure kernel startup timeouts is by defining two environment variables:

KERNEL_LAUNCH_TIMEOUT for JupyterHub
EG_JUPYTER_GATEWAY_CONNECT_TIMEOUT for Jupyter Enterprise Gateway

These variables have to be defined in a new file named jupyterhub-singleuser-gw, under /etc/default/.

An example for a 120-second timeout is provided below:

# cat /etc/default/jupyterhub-singleuser-gw
export KERNEL_LAUNCH_TIMEOUT=120
export EG_JUPYTER_GATEWAY_CONNECT_TIMEOUT=120

Please note this configuration file has to be created on every Jupyter login node.
By default, the Jupyter login node is the cluster’s head node, but configurations may vary and multiple login nodes may exist.

Please also note cm-jupyter version 12.2.0 or newer is required to set a timeout longer than 2 minutes.

Updated on October 26, 2021

Tagged: jupyter kernels

Comments

Robert Stober
3 years ago

Does the cm-jupyterhub service need to be restarted after the /etc/default/jupyterhub-singleuser-gw is created, or will these configs be picked up automatically?

Log in to Reply
1. Ken Woods
  3 years ago
  
  Hi Robert,
  Thanks for asking. The service should indeed be restarted.
  
  Log in to Reply

Configuration

Related Articles

Comments

Leave a Comment Cancel