1. Home
  2. How do I tune/adjust monitoring?

How do I tune/adjust monitoring?

There are a couple of options–each acting in different ways to impact different items.

Option 1:
Enable fuzzyoffset on the monitoring objects.
If you are seeing checks like ssh2node, ldap, and ntp showing as flapping between failed and passing, you can try enabling fuzzyoffset which adds an element of random timing to the checks.

For example:
# cmsh
% monitoring
% setup
% use ntp
% set fuzzyoffset .60
% commit

Think about the offset as a number that’s multiplied against the sampling time interval to fix a maximum value for the time offset for when the sampling takes place. The actual offset used per node is spread out reasonably evenly within the range up to that maximum time offset.

For example, for a sampling time interval of 120s:

If the offset is 0, then there is no offset, and the sampling is attempted for all nodes at the exact time instant when the interval restarts. This can lead to an overload at the time of sampling.

If, on the other hand, the offset is 0.6, then the sampling is done within a range offset from the time of sampling by a maximum of 0.6 × 120s = 72s. So, each node is sampled at a time that is offset by up to 72s from when the 120s interval restarts. From the time the change in the value of the fuzzy offset starts working, the offset is set for each node. The instant at which sampling is carried out on a node then differs from the other nodes, even though each node still has an interval of 120s between sampling.

An algorithm is used that tends to even out the spread of the instants at which sampling is carried out within the range. The spreading of sampling has the effect of reducing the chance of overload at the time of sampling.

Option 2:
Increase the time out for certain checks.
If a health check is flapping between an Unknown state and then Pass, this could be a timeout issue. Try increasing the healthcheck timeout.

For example (This will set the timeout for ntp to 60 seconds, default is 10 seconds):
# cmsh
% monitoring
% setup
% use ntp
% set timeout 60
% commit

Option 3:
Increase the running interval. Essentially run the checks less often, which may reduce the load.

For example (This will set the interval to 5 minutes, default is 2 minutes):
# cmsh
% monitoring
% setup
% use ntp
% set interval 5m
% commit

Option 4:
If the device state is flapping but other health checks pass, try increasing the ICMP Timeout internally in Bright.
In /cm/local/apps/cmd/cmd.conf add this line (Note: This should be done on both headnodes):
AdvancedConfig = { "ICMPPingTimeout=20000" }
After adding the above line on both head nodes, restart the cmd service.
# service cmd restart
Monitor the cluster events to see if that improves the monitoring.

Updated on April 28, 2021

Was this article helpful?

Leave a Comment