This is the fifth in a series of blog posts providing tips and tricks for answering common data center questions with the use of SPM. In my last post, I gave some guidance on answering, “How do I prove power system redundancy?” Though uptime is of prime concern, there will occasionally be outages at some level of the power chain which leave us scratching our heads.
Question: What happened to cause this outage?
Depending upon who you ask, anywhere from 40% to 80% of outages are caused by “human error”. Of course, there is wide variation in the definitions of “outage” and “human error”, but the point is that power outages caused by human error are presumed to be correctable and preventable.
- Before you worry about the details of what happened, you must have received notification that something of negative impact did occur. SPM brings these alerts to you in the form of emails.
- Meaningful alerts start with naming the components and sub-components of a system appropriately based on connections with other equipment and physical location or relevance. SPM allows access into the depths of the Server Technology CDU for convenient naming of infeed power connections, outlets to rack devices, and environmental sensors. Additionally, groupings of outlets and racks/infeeds allows for larger scale organization of equipment.
- The alerts are, of course, based on thresholds that describe the in-bound / out-of-bound conditions of particular power and environmental measurements. These are best set with the culture of the organization in mind. That is, set them initially lax or tight based on whether it is better to receive under-alerting or over-alerting. Then, most importantly, review those threshold levels on a regular basis to keep your system in tune with the organizational goals.
- Finally, SPM will provide its alerting through email. Be sure to set these to send to the right personnel, and only the right personnel.
- Once you get that alert, it then becomes incumbent upon you to identify what led up to the condition of concern. SPM provides trending and reporting of all critical values for these purposes.
- Create meaningful trends within SPM to display the critical power and environmental parameters. The trending configurator allows for multiple point metrics, overlaying of time periods, and overlaying of different types of metrics such as temperature with power.
- Once an event has occurred, it can be quite useful to keep a close eye on that particular section of the data center or particular equipment. SPM’s built-in live-updating Views configurator will allow each user, no matter their responsibility to see what is happening in the data center while performing their everyday tasks.
- Continuing on the proactive discussion, any trends and reports you build for understanding recent events can be, in turn, set to a scheduled email to keep you informed of conditions that may be approaching alert again.
- Returning to the concept that most downtime in the data center is due to human error, an analysis of what happened leads to the question of “who did what?” and “what can we do better?”
- The first step to accountability and traceability is to assign appropriate user rights to personnel interacting with SPM. Authentication can be setup using LDAP(S) or TACACS+ like most other network software. Additionally, setup user rights within SPM to be sure that only the right people have access to make changes to the system or to control CDU outlets or configurations.
- When an alert occurs that could be personnel related, be sure to check the SPM logs for event details and times.
- Finally, SPM can help avoid certain personnel errors through scheduling maintenance outlet control actions. This can be particularly useful when the desire is to be sure that certain equipment is powered up at particular times.
Though your data center may be designed to Tier IV, and your personnel and processes proactively optimized, downtime will happen at some time at some level in the power chain. It becomes incumbent upon the responsible personnel to find out “What happened?” to cause the outage. SPM helps find the answers. For more information on troubleshooting using SPM, contact our technical staff at firstname.lastname@example.org.