The instructions in this article can be followed to enable Kdump on Ubuntu 20.04 compute nodes.
As an additional precaution, if you have a test compute node you could consider that for testing this procedure first. One possibility is to clone the existing production software image in use on the nodes into a new “cloned” software image, then apply the changes in that newly-cloned image. After applying the changes to that cloned image, you could then set a designated test node to use the newly-cloned software image. The cmsh commands to clone an existing image to a new software image then setting a node to use that software image are as follows:
% softwareimage % clone <existing_image_name> <cloned_image_name> % commit (wait for the image copy and ramdisk creation to complete) % device use <test_node> % set softwareimage <cloned_image_name> % commit (wait for the ramdisk creation to complete)
Install the linux-crashdump package in the software image for the Ubuntu node(s) or category. Answer ‘yes’ to the questions “Should kexec-tools handle reboots (sysvinit only)?” and “Should kdump-tools be enabled by default?” which will be prompted during the installation of the linux-crashdump package.
# cm-chroot-sw-img /cm/images/<image_name>/ # apt update # apt install linux-crashdump
After that, add the following line to the /etc/default/kdump-tools file in the software image:
KDUMP_CMDLINE=`echo "root=$(findmnt -fno SOURCE /) $(cat /proc/cmdline)"`
Update software image kernel parameters
Update the kernel parameters of the software image from cmsh. Here the crashkernel parameter has been set to 512M as an example, the parameter value may need to be adjusted as per node memory size.
% softwareimage use <image_name> % append kernelparameters " crashkernel=512M" % commit
Add /var/crash files to the category exclude list
Run the cmsh commands below to add /var/crash/* to the category exclude list for SYNC install mode, so that the /var/crash/* files are not wiped out during SYNC install.
% category use <category_name> % set excludelistsyncinstall (Add the following 2 lines, then save and exit) - /var/crash/* no-new-files: - /var/crash/* % commit
Reboot and test Kdump configuration
To apply the kdump configuration, the node needs to be rebooted. Once the node is back up, run ‘kdump-config status’ command on the compute node to check if Kdump is ready, an example command output is given below for reference:
# kdump-config status current state : ready to kdump
If Kdump is ready, then Kdump functionality can be tested by causing a node OS crash by the sysrq-trigger. To enable sysrq on the compute node set the kernel parameter kernel.sysrq to 1 as follows:
# sysctl -w kernel.sysrq=1
After that, the following command can be run on the compute node to crash the node:
# echo c > /proc/sysrq-trigger
After running the command, the compute node OS will crash, kexec will load the crashkernel, the dump will be taken and the compute node will be rebooted. Once the node is back up dump files should be found under /var/crash directory of the compute node.