A cluster can sometimes show general weirdness, throwing up odd non-specific errors.
A checklist of suggestions to go through first:
1. check the hardware
Burn tests such as memcheck can detect certain hardware errors. While good cluster vendors run burn tests before handing over the cluster hardware, good hardware can go bad with time. The burn test suites provided with Bright Cluster Manager (documented in the Administrator Manual appendix) allow extensive planned burn tests to be carried out.
It can’t catch all hardware issues (flakey electricity/ground, marginal SMPS voltages), but any errors seen while in the pre-install section of a burn run means there is some kind of hardware issue. With Linux being so stable, the post-install section of a burn run also has a good chance of detecting a hardware issue rather than stumbling over a software one.
2. check the software
(i) any clues in any of the logs etc? Check /var/log/messages, /var/log/cmdaemon, event logs, console messages. The logs often warn the administrator of impending issues before the problems actually show up.
(ii) any of the partitions full? Especially /var can get filled with logs or mails. df -H and having a look at the Use% column (100% is bad) should show this. A full partition leads to a variety of errors followed by a freeze.
(iii) database corrupt? If a partition gets too full, the database is especially vulnerable to losing data. cmd will complain. Fixing the database, or recloning it may be needed. In order of increasingly drastic measures: The mysql utilities can fix slightly corrupted databases. For failover systems, the dbreclone option of the cmha utility can be used (Administrator Manual, High Availability chapter). The daily rotating-over-7-days backup can be used to restore the database (Administrator Manual, section on backups), or a restoration from cluster backup may be used.
(iv) Does /tmp partition have the right permissions (1777)? The permissions for /tmp are set correctly during distribution installation. However if the administrator moves /tmp over to somewhere else without ensuring the permissions are right, odd symptoms will keep showing up, though the cluster may manage to keep going with a variety of confusing errors.
(v) Do the databases under /var/lib/ldap/__* have the right ownerships (ldap:ldap)? The ownership of the databases can get changed during moves/repopulation by the administrator. This can lead to odd symptoms in which LDAP may or may not start up, and if it does start up it has no users, which in turn causes further odd behavior that takes a while to trace to the origin. A health script/action combination to check for this could be:
#!/bin/bash
if [ $(find /var/lib/ldap -maxdepth 1 -user "root" -name "__db.*" | wc -l) -eq 0 ]; then
echo PASS
else
for i in $(find /var/lib/ldap -maxdepth 1 -user "root" -name "__db.*"); do
chown ldap:ldap $i;
done
echo "ldap database ownership corrected" >&3
echo FAIL
fi
This can be modified and used for some of the other issues too, of course.
(vi) Other odd ownership/permission issues? When juggling LDAP, for example during upgrades, be careful if any of the LDAP users are migrated to or from system (/etc/passwd). Some file trees may need fixing as a result.