Monitoring A Service For Effective Problem Solving
Hello reader, today lets think about effective monitoring of a service when it’s deployed in production. Most engineers will have experienced the pain of trying to figure out what is going on with limited or irrelevant monitoring. Based on my experience trying to debug issues in a running service I believe that the following is a list of things that need to be fulfilled by a service before that service is used in production to allow effective debugging of any issues that come up.
Log When Operation Is Accepted
Log an “Operation accepted” or similar message at the earliest opportunity when there is enough information to uniquely identify the operation.
If this log is missing contrary to other information (e.g. operation is logged by the load balancer or API gateway but not in the service) then the service is currently overloaded and unable to accept new operations.
Without this log it will be hard to tell the difference between an operation that’s stuck halfway during processing and an operation that hasn’t arrived at the service.
Log After Operation Completes
Most services will log a message when an operation completes. The metadata that’s part of this log will dictate how useful this log will be. I advocate including as much metadata as possible in the final log e.g.
- Duration
- Start time
- Records processed
- etc…
Including as much metadata in the final log line makes querying the logs a lot easier, especially if you want to create dashboards and metrics from the logs. Comparison between different operations becomes a lot easier as well.
Avoid Logging During The Operation
Often engineers will add logging to report some of progress or state of an operation. This usually means that the logs are full of messages that do not contribute much to diagnosing an issue and those messages have a real cost in terms of storage and query times.
Instead, I think it’s best to update a context object as the operation proceeds then log the whole context as part of the final log after the operation completes. For very long running operations the partial context can be logged at regular intervals to indicate some sort of progress in the operation, with the interval chosen so that only the slowest requests get these progress logs.
Include Resolution Steps Where Possible
Sometimes it’s necessary to log some sort of warning or error to surface some issue that needs attention. Examples might be “auto migration to new table failed” or “found two records when one was expected, ignoring second”. Including the resolution or a link to where the resolution can be found during diagnosis of issues.
This is especially important if an exception is explicitly caught and logged. The engineer who is looking at the log will most likely be not that familiar with the service and if there are resolution steps as part of the exception log then it will be easier for them to address the issue at hand.
I would also argue that if no resolution is available then the log will just end up being noise. It can just be metadata added to the context for logging after the operation is complete.
Include Some Sort Of Restart/Crash Resilience
When a service is running happily we generally don’t care too much about the logs. However if the service crashes or is restarted for any reason then knowing what the service is up to before the event can be very helpful in figuring out the root cause of the crash or restart. If logs are missing because the service was not able to perform a logging operation then figuring things out becomes a lot harder.
Increasing the frequency of the logging to help with crash/restart scenarios certainly helps but it means that under normal happy operation the logs will be full of information that is in general not very useful. Instead a good alternative is for every operation to be saving progress information into a persistent local file (sqlite is good for this). During normal operations the file can be trimmed to ensure that it doesn’t grow too large but if the service restarts or crashes then the file has a record of what is happening just before the event.
Access to this file can be done via pulling it directly off the service via some API or the service could dump the context of the file as part of it’s startup procedure.
I’ve spent many hours grappling with the lack of information or too much information in the logs when trying to figure out issues in production. I’ve thought a lot about how to improve the monitoring situation to make my life easier and the things I’ve written down above are some of those ways. Hopefully this also gives you, dear reader, some ideas on how to improve the monitoring of any services that you are responsible for!