1. Home
  2. How do I tune/adjust monitoring?

How do I tune/adjust monitoring?

Depending on the size, complexity, and hardware of your cluster the defaults for monitoring may be too aggressive. Below are several common steps you can implement to reduce the aggressiveness of the monitoring if you are experiencing issues.

Option 1: Enable fuzzyoffset on the monitoring objects.


If you are seeing checks like ssh2node, ldap, and ntp showing as flapping between failed and passing, you can try enabling fuzzyoffset which adds an element of random timing to the checks.

For example:

# cmsh
% monitoring
% setup
% use ntp
% set fuzzyoffset .60
% commit

Think about the offset as a number that’s multiplied against the sampling time interval to fix a maximum value for the time offset for when the sampling takes place. The actual offset used per node is spread out reasonably evenly within the range up to that maximum time offset.

For example, for a sampling time interval of 120s:

If the offset is 0, then there is no offset, and the sampling is attempted for all nodes at the exact time instant when the interval restarts. This can lead to an overload at the time of sampling.

If, on the other hand, the offset is 0.6, then the sampling is done within a range offset from the time of sampling by a maximum of 0.6 × 120s = 72s. So, each node is sampled at a time that is offset by up to 72s from when the 120s interval restarts. From the time the change in the value of the fuzzy offset starts working, the offset is set for each node. The instant at which sampling is carried out on a node then differs from the other nodes, even though each node still has an interval of 120s between sampling.

An algorithm is used that tends to even out the spread of the instants at which sampling is carried out within the range. The spreading of sampling has the effect of reducing the chance of overload at the time of sampling.

Option 2: Increase the time out for certain checks.


If a health check is flapping between an Unknown state and then Pass, this could be a timeout issue. Try increasing the healthcheck timeout.

For example (This will set the timeout for ntp to 60 seconds, default is 10 seconds):

# cmsh
% monitoring
% setup
% use ntp
% set timeout 60
% commit

Option 3: Increase the running interval.

Essentially run the checks less often, which may reduce the load.

For example (This will set the interval to 5 minutes, default is 2 minutes):

# cmsh
% monitoring
% setup
% use ntp
% set interval 5m
% commit

Option 4: Increase ICMP timeout

If the device state is flapping but other health checks pass, try increasing the ICMP Timeout internally in Bright.
In /cm/local/apps/cmd/cmd.conf add this line (Note: This should be done on both headnodes):
AdvancedConfig = { "ICMPPingTimeout=20000" }
After adding the above line on both head nodes, restart the cmd service.
# service cmd restart
Monitor the cluster events to see if that improves the monitoring.

Option 5: Increase failbeforedown

If you are seeing the systems in an up and down state with the message “flapping”, try increasing the number of failures before the node is marked as down. Default is 1 or 3 failures depending on node type.


For example (increase number of failures to 5 before marking as down):

# cmsh
% device
% open --failbeforedown 5 -n node001..node10

Or to update all genericdevices at once.

# cmsh
% device
% open -t genericdevice --failbeforedown 5
Updated on August 23, 2021

Leave a Comment