Project Spotlight: HP Smart Array S.M.A.R.T. Prometheus Exporter

Introduction

Like any homelabber, I love having Grafana dashboards displaying metrics about my hardware. Uptime, RAM utilization, CPU load, you name it. These metrics are usually displayed by querying a time-series data model called Prometheus, which itself aggregrates metrics from a variety of different sources.

One common type of Prometheus source is called an exporter, a program which either observes the activity of a system or application and surfaces that activity over an HTTP server in a standardized format easily consumable by Prometheus. For example, the exporter used to collect metrics from a host computer is called the node_exporter.

Some of these metrics are very important to keep an eye on: for example, knowing if a machine is too hot based on its current CPU temperature, or if it’s about to run out of storage based on its disk utilization. When it comes to hard drives, it’s also very important to know if a drive is close to failure, so you can make preparations to replace it.

S.M.A.R.T. Tests

Self-monitoring, Analysis and Reporting Technology System (S.M.A.R.T.) is a protocol available on most modern hard drives. It can be used to run tests against disks that will show signs of degradation or impending failure, while also collecting information about the disk’s lifetime usage of bytes read, total operational hours, and so on.

These are a strong, but not guaranteed, early warning sign of drive failure, and so naturally there is a Prometheus exporter which is used to surface S.M.A.R.T. data as metrics: the smartctl_exporter, written in Go. This is a very effective tool when targeting disks that are directly accessible to the operating system. Where it falls short, however, is in hardware RAID arrays.

HP Smart Array

Now you understand why I insisted on using periods in the initialization of S.M.A.R.T. It’s so you won’t get it confused with Smart Arrays. :)

Hewlett Package Smart Array Controllers are a type of hardware RAID array, usually found in HP ProLiant servers. I won’t go too far into the weeds of RAID arrays here, so suffice to say that when using a RAID array, the individual disks of the array are opaque to the operating system, appearing instead as a single very large physical disk.

These individual disks can be interacted with using a special set of utilities— in this case, a CLI program called hpssacli. smartctl, the CLI program used to run S.M.A.R.T. tests manually, also has support for running tests against an array’s individual disks, so long as the correct drivers are installed and the command is formulated correctly.

The Project

The existing smartctl_exporter, does not support running tests against individual disks. This is really no fault of the exporter itself. Recall that I mentioned that although smartctl has the ability to run such tests, it requires a special format of command. Here’s what that looks like:

# Against a normal disk
smartctl -d /dev/sda

# Against the first disk in a RAID array:
smartctl -d ciss,0 /dev/sg0

The second command takes a path to a device, as well as the ciss,0 piece. The 0 is the index of the physical disk in the RAID array, while the /dev/sg0 path is a way of referencing the RAID array logically using the SCSI Generic driver. To my knowledge, there is not a way of generically determining how many physical disks are located inside of a hardware RAID array controller. Some controllers have eight slots, some have only four, some have sixteen, and so on. The number of slots can usually only be identified using a utility specific to that hardware controller.

This makes the task of supporting S.M.A.R.T. metrics on hardware RAID arrays much more difficult for the smartctl_exporter project. They would have to check if the bespoke CLI utility for each supported model of RAID array was available on the platform, then go into special-case logic to invoke that utility to determine the number of individual disks in that controller. Hypothetically they could do something like attempting to run the tests against a range of disks sequentially until the command fails, then cache the number for later, but this is still rather inelegant.

What might work better is an entirely separate exporter dedicated to the RAID array controller it is intended to collect the metrics for.

A Partial Solution

Now admittedly, I am telling the story a bit backwards here. In my search for an easy way to get S.M.A.R.T. metrics from my HP Smart Array, I came across a repository called smartctl_ssacli_exporter. This looked like exactly what I wanted, but upon trying to build it, it failed to compile for want of a couple of small tweaks, so I forked the repository and fixed it up myself.

It was not until digging fairly deep into this repository that I realized it didn’t export much in the way of S.M.A.R.T. metrics at all. There was some information being surfaced about the array’s physical and logical disks that was collected from the hpssacli utility, but little in the way of S.M.A.R.T. data. In fact it was investigating this lack of metrics that lead me down the rabbit hole of fully understanding how smartctl interacts with a RAID array and why smartctl_exporter is unable to provide metrics for them.

Let’s go. In and out. 20 minute Go package update.

So my little fork of smartctl_ssacli_exporter became more or less a full rewrite. I kept some of the metrics surfaced from the hpssacli command (although I rewrote the parsing) then added logging, logic for determining how many RAID array controllers were on the system and logic for running S.M.A.R.T. tests against each disk on each array. The actual metrics surfaced about the S.M.A.R.T. tests I mostly tore straight out of the code for smartctl_exporter.

Initially I waffled about whether or not to just open a pull request against smartctl_exporter and add the functionality there instead, but decided that the dependency upon the hpssacli utility that my implementation would introduce meant that it was best left as a dedicated project.

Preserving the Principle of Least Privilege

Not long after I finished coding the changes to smartctl_ssacli_exporter did I encounter another problem: the hpssacli utility must be run as root. Generally, it’s considered bad practice to allow utilities to have root privileges, because it means if there is a vulnerability in that utility that allows an attacker to gain control of the process, then it can wreak a lot havoc. Linux has a litany of different access controls that can be used to give processes or users more granular privileges to certain system resources without giving them full root privileges.

Unfortunately, none of these more granular privileges are acceptable for hpssacli. I spent a good bit of time looking at its system calls using strace to see if I could figure out a way around it, but my guess is that it just checks if it is running under a hard-coded UID and quits. Here is a good article diving into further detail about the problem.

So, what’s the next best thing? Well, even if hpssacli absolutely has to run with root privileges, that doesn’t mean smartctl_ssacli_exporter does too. We can use a privilege escalation tool, such as sudo. Fortunately, sudo can be configured to only allow privilege escalation for certain commands or executables. This way the process that is running the exporter can be permitted to run only hpssacli as root, without being allowed to execute any other processes as root. While still not perfect, it means that compromising smartctl_ssacli_exporter on its own doesn’t buy an attacker very much.

Conclusion

This project started out pretty simple and increased in size and complexity with each step. But, it was a good opportunity to dabble in the Golang programming language and learn a lot more about Linux access controls than I had previously ever understood. You can find the finished product on my Github, here.