IMPORTANT! The below scripts are not supported by Dell nor is there use on a Dell ECS node, you therefore utilise these steps at your own risk. We have these scripts running on Dell ECS nodes in production to collect hardware information and report it back periodically and no issues have been identified yet.
The Dell ECS platform at the time of writing, does not appear to have end user visible hardware fault information visible via the Dell ECS Web GUI, REST API or Email/Syslog notifications, this therefore limits your ability to see certain types of problem. However these events/issues are sent to Dell via ESRS therefore they should have visibility of them, even if you do not (directly).
The script(s) below should be placed on each node and set to run periodically, they check the hardware status, process the result and report it back to NagiosXI using a passive check with NRDP.
The script can be found here: https://github.com/tristanhself/general/blob/9ee1151e2617db036bbd5fcc4a26d401c3c92a1f/check_racadm_hardware.sh
The check_nrdp.sh script is a script available with NagiosXI due to this, I cannot distribute it here, but you can find this directly from NagiosXI.
The process requires you to deploy the script on each node, set the cron job, then create a “Passive Check” within your NagiosXI configuration that will update with the status posted from the node. This also involves the use of freshness checking on the NagiosXI configuration so if a node was to fail to report in within a particular time you are alerted and take action.
Step 1 – Put the Monitoring Script in Place
Logon with SSH to the node, this needs to be performed on each node you wish to monitor.
Put the check_racadm_hardware.sh file and the send_nrdp.sh file into the /tmp directory.
Ensure they are executable only by root with:
sudo -i chmod +x /tmp/check_racadm_hardware.sh sudo -i chmod +x /tmp/send_nrdp.sh
Step 2 – Set the Cron to Run Automatically
You also need to perform this step on each node you wish to monitor.
Put the following file into the /etc/cron.d directory with the following contents, this will run the script every 12 hours of every day. Its recommended to stagger this by a few minutes across the nodes.
Create and edit with:
sudo -i vi /etc/cron.d/hardware_racadm
With the contents:
SHELL=/bin/bash PATH=/sbin:/usr/sbin:/bin:/usr/bin * */12 * * * admin /tmp/check_racadm_hardware.sh
Step 3 – Configure NagiosXI
Create a passive check ECS node host in NagiosXI configuration.
Then wait for the script to report in periodically. We check ideally staggered times across all the hosts and sites. It is also recommended to set the freshness check to at least twice the period of the check. I.e. if the check sends every 12 hours (43,200 seconds), you should set freshness check to 24 hours (86,400 seconds).