This article is being updated. Please be aware the content herein, not limited to version numbers and slight syntax changes, may not match the output from the most recent versions of Bright. This notation will be removed when the content has been updated.
Sometimes when compute nodes stay up for long periods of time, /dev/shm
gets filled with stale files. This can happen if MPI jobs abort in an unexpected way.
The stale files get cleaned up if the node is rebooted. A cleanup that avoids a reboot is also possible, simply by remounting /dev/shm
, but this may affect MPI jobs using /dev/shm
at that time.
A gentler way to deal with this is to have a script clean /dev/shm
if needed. It can be run each time a job attempts to start by adding it as a custom prejob health check.
The following script deletes files under /dev/shm
that don’t belong users that are running jobs on the node:
#!/bin/bash
SHMDIR=/dev/shm
# do not remove stale root files
ignoretoken="-not -user root"
# get the users in the node via ps, as w/who don't work without login
for user in $(ps -eo euid,euser --sort +euid --no-headers | awk '{if($1 > 1000) print $2;}' | uniq)
do
ignoretoken="${ignoretoken} -not -user $user"
done
# clean up
find $SHMDIR -mindepth 1 $ignoretoken -delete
The following steps add a prejob healthcheck via cmsh:
# cmsh
% monitoring healthchecks
% add clear_shm
% set command /path/to/custom/script
% commit
% setup healthconf <category name>
% add clear_shm
% set checkinterval prejob
% commit