Why Every Sitecore Implementation Needs Distributed Tracing
Thoughts from a software engineer that has spent 10+ years architecting, building, enhancing, and upgrading Sitecore implementations
The Problem
An error is happening in production. The marketing team has just launched a major email campaign and the new landing page is displaying an error instead of the new call to action component. Everything looked great in the development, integration, and pre-production environments. Even the production environment passed through QA and was signed off on before the marketing campaign went live. What happened?
At this point engineers are going through the production logs to try and identify what's happening and determine a root cause. A stand-alone Sitecore XM environment has, at a minimum, 1x Content Management instance, 2x Content Delivery instances, a SQL cluster, Redis for caching, identity servers for processing logins, SOLR instances for search, etc. Add in a lot more if you're running an XP version of Sitecore. That's a lot of server logs to review for an incident in progress.
Having a log aggregation service like Splunk can help, but you're often still searching for the needle in a haystack. Often that haystack can contain thousands of log entries per second. And try not to think about what Splunk costs to keep all of those logs for any meaningful length of time.
The other challenge is that any issue occurring anywhere in the enterprise that serves content to Sitecore could actually be the culprit behind the production issue. Frequently, it's a microservice issue with another team that is propagating to the website. This is usually something like a user profile service or an order history service experiencing problems. It doesn't even have to be a complete outage of those services—increased latency in requests to them could trigger cascading issues. The fix could be as simple as notifying the relevant team about the urgency of scaling their service (microservices can be multiple tiers deep, and the team may not know where their data is being used or how critical it is to your customer-facing application).
So how do we identify the root cause of a production issue quickly?
Warnings, Errors, and Log Levels
Server logs grow over time (obviously). The rate at which servers create logs also tends to increase over time. Pretty much every service has many different log levels. A log level is essentially how much data is logged for every action that service takes. The most common log levels are Error, Warn, Info, Debug, and Trace. Each level adds significantly more log data than the previous level.
The Error level logs only actual errors that have occurred directly in this service itself. On the other hand, the Trace level logs literally everything that happens on every request. Usually, services start out at either the Warn or the Info log level. It's not unusual for this log level to move toward the Debug or even Trace log levels as challenging issues that are hard to debug occur over time. This typically happens to different services at different times.
These log level changes are similar to technical debt in that everyone knows that a server shouldn't be left at the Debug or Trace log level. It's often out of sight and out of mind though. It can even be helpful for a time as it can help identify issues faster. Over time though, the data keeps growing, and it becomes much more challenging to find the signal in the noise of all the log data.
The Solution
The solution is to implement Distributed Tracing. Distributed tracing is a technique used to monitor and analyze requests as they travel across a distributed system (such as a microservices architecture). It helps in tracking the flow of requests between different services, identifying performance bottlenecks, diagnosing errors, and understanding system behavior. It is the key that ties the logs between all of the different services together.
Each request into the system gets assigned a unique trace ID. As each request moves within a service, span IDs can be added to identify individual operations within that service. As one microservice calls another, it includes the trace ID in the request header. This allows troubleshooters to quickly see that the Warn log on the Content Delivery server is directly tied to a timeout Error log in the User Profile microservice.
This approach does require buy-in throughout the organization, as each application and service needs to be updated to support distributed tracing. The effort level is typically fairly low though, as the libraries that implement distributed tracing quite typically work at a low level and often require minimal configuration out of the box.
Common Distributed Tracing Libraries and Tools
Here are some popular distributed tracing libraries and frameworks that you can implement in your system:
OpenTelemetry - An observability framework that merged OpenTracing and OpenCensus; provides language-specific SDKs for various programming environments
Jaeger - An open-source, end-to-end distributed tracing system that's compatible with OpenTelemetry and the OpenTracing standard
Zipkin - One of the earliest open-source distributed tracing systems, inspired by Google's Dapper paper
AWS X-Ray - AWS's distributed tracing service that helps analyze and debug distributed applications
Azure Application Insights - Microsoft's APM service that includes distributed tracing capabilities
Datadog APM - A commercial APM solution with strong distributed tracing features
New Relic Distributed Tracing - Part of New Relic's observability platform
Elastic APM - Distributed tracing integrated with the Elastic Stack
Dynatrace PurePath - Dynatrace's distributed tracing technology
Honeycomb - A commercial observability tool with powerful distributed tracing capabilities
Most of these tools follow the W3C Trace Context standard, which defines a standard format for propagating distributed trace context between services. This makes them relatively interoperable and easier to adopt.