Service health metrics and alarms 🚨 — the immune system for web applications
What is the role of immune systems in human body? It helps in detecting anomalies and alerts the body. Further it helps in mitigating the issues to a great extend. Is there a similar mechanism to detect software anomalies and take corrective actions? Yes, that is done by service health metrics and alarms.
Why would software services run into issues?
There are a myriad of reasons why software systems could run into issues. In software development, these issues are called errors. It is any deviation from the expected working of a software. These errors could be categorised into two broad categories — client errors and server errors.
Client errors
These are issues that arises when the client does something that is not expected by the server in the normal business flow. It is the responsibility of the service to ensure most common client errors are handled properly and respond accordingly. For example, a user providing wrong credentials while trying to log-in would be handled by the server with an appropriate response, something like “Incorrect username/password provided.” Another example would be a user trying to access a page (or a resource) that does not exist. Client errors are more common, but often less serious.
Server errors
When the client requests seems alright but the server still couldn’t complete the request is what we call as a server error. These could be due to bugs in the software itself, due to errors in dependent services or rarely in the environment that runs the software. One of the most common server error is a NullPointerException that occurs when the service expected something to have a value, but it didn't. In an ideal scenario, the service should never run into server-side errors and hence server side errors are treated with much more seriousness.
Why is early detection critical?
Software systems run into issues quite often, just like human body. Most times, the issues would be trivial and might not cause damages to business. However, there are times when an early detection of such issues would be critical to ensure normal business operations, just like early detection is critical for several major illness to humans. One example of such a scenario where early intervention is critical is when some recent changes resulted in breaking a previously working use-case, that could cause revenue loss or user inconvenience.
How can software be monitored for issues?
This is where metrics come handy. More often than not, services would have appropriate handlers for different hierarchies of errors. The trick is to record every instance of such handlers getting invoked, usually by emiting a metric to a metric recorder. You can think of this as calling an API to a metrics collection service (often externally hosted). Popular metrics collection offerings are AWS CloudWatch, Splunk, DataDog etc. Alternatively, service logs could be parsed to emit the metrics, through capabilities offered by the log services (eg. AWS CloudWatch logs insights) or by periodical processing of the logs files.
One important aspect of enabling metrics is the granularity in which the errors/warnings are recorded. A single metric for all the errors — client errors and server errors combined — might not be very useful. Rather having granular metrics for each error type, even within these two broad categories, would provide better insights into the service health and help configure the alarms better.
When using cloud offerings, most cloud providers would have metrics in place for service/resource utilisation and errors. These can also be monitored to evaluate service health.
What are health alarms?
Once we have the right metrics emitted, the next step would be to define alarms on those metrics. For example, the development team or the on-call might need to be alerted if there are server errors, or if the resource utilisation breached certain threshold. Most metrics providers like AWS CloudWatch lets you define alarms on the metrics. Often there could be multiple alarms of varying severity defined on the same metrics — for example, a low severity alarm can be triggered when the memory utilisation reaches 60%, but a high severity alarm might be more appropriate when it reaches 90%.
There could also be computed/calculated metrics based on raw metrics as well. For example, a failure rate would be a metric of interest based on success counts and failed counts and there could be alarms set when the failure rate is higher than an acceptable threshold.
Should you set alarms on the client errors as well?
Yes, monitoring and having proper alarms set on client error metrics are equally important. Often, the thresholds defined for client errors would be much higher, you wouldn’t want to get alerted when a random user entered a wrong password once. But it would be good to when there is a significant percentage of logging attempts are getting failed. Also sometimes client side errors might mask server-side issues/bugs. For example, if there is an inadvertent change in the password hashing, it could result in legit login attempts to fail with a client error.
What happens when the service is in alarm state?
In the events of alarms getting triggered, the monitoring system could send emails to the team, or page the on-call/support-person-in-charge. In more advanced systems having proper DevOps pipelines in place, these metrics could also trigger deployment rollbacks, auto-scaling etc.
Conclusion
By emitting the relevant metrics and configuring appropriate alarms, the owning teams could get notified much before any issues would have major business impacts or outages.
At Amazon, we use a lot of service metrics and alarms and it is a default expectation for developers to add relevant metrics at the time of development itself. All services are monitored 24x7 through these metrics and alarms configured on these metrics. When there is a high severity alarm that is triggered, that would create a ticket and page the on-call person for the development that owns the service so that they can look into the issue immediately. For low-severity alarms, the development team is notified through a ticket, but no paging is involved.