I learned the importance of observability the hard way when I was tasked with debugging a major issue in an application that was not observable. The app was essentially not working at scale but there were no logs, metrics, traces, or errors to help me understand the problem. This was also a problem because the issue was not easily reproducible. This led me to learn more about Datadog and how to use it to make software observable.
The first thing to be done was to upgrade Datadog within the application. Datadog was technically running on the app but no useful information was logged and any logs that did show up were not tagged with any information. I upgraded Datadog to the latest version on the application and even got it running in my local environment in docker.
Next I normalized the logging format in the application to make the logs consistent and informative. Including standard key value pairs such as customer IDs, class names, and error messages allowed me to see how the app was behaving at every step in the request cycle. Adding tags to all of the traces also allowed me to quickly identify where an issue occurred.
Metrics are important here because collecting all of the logs that come through the app is expensive. By utilizing metrics we can see the number of requests and errors at any point in the request cycle without indexing all of the logs.
The last thing needed to make the app observable was to aggregate the logs and traces in a Datadog dashboard. I used a timeboard to visualize the sampled logs at every step in the request cycle and displayed alongside them the error rates and metrics created from those logs. This has proven to be very valuable in debugging and monitoring issues reported in the app. It has significantly reduced the time it takes to investigate and fix issues.