Debug the slowest trace on the slowest endpoint of a web service
Datadog's Research Report: The State of Serverless Report: The State of Serverless

Debug the slowest trace on the slowest endpoint of a web service

3 minutes to complete

With Datadog APM, you can easily investigate the performance of your endpoints, identify slow requests, and investigate the root cause of latency issues. This example shows the slowest trace of the day for an e-commerce checkout endpoint and how it slows down because of high CPU usage.

  1. Open the Services page.

    This page contains a list of all instrumented services available in Datadog APM. Note you can search for keywords, filter by env-tag, and set the timeline.

  2. Search for a relevant and active web service and open the Service Page.

    The web-store service is used in this example because it is the primary server in the tech stack and it controls most calls to third party services.

    In addition to throughput, latency and error rate information, the Service Page contains a list of Resources (major operations like API endpoints, SQL queries, and web requests) identified for the service.

  3. Sort the Resource table by p99 latency and click into the slowest resource. Note: If you cannot see a p99 latency column, you can click on the cog icon Change Columns and flip the switch for p99.

    The Resource page contains high-level metrics about this resource like throughput, latency, error rate, and a breakdown of the time spent on each downstream service from the resource. In addition, it contains the specific traces that pass through the resource and an aggregate view of the spans that make up these traces.

  4. Set the time filter to 1d One Day. Scroll down to the Traces table and sort it by duration, hover over over the top trace in the table and click View Trace

    This is the Flamegraph and associated information. Here you can see the duration of each step in the trace and whether it is erroneous. This is useful in identifying slow components and error-prone ones. The Flamegraph can be zoomed, scrolled, and explored naturally. Under the Flamegraph you can see associated metadata, Logs, and Host information.

    The Flamegraph is a great way of identifying the precise piece of your stack that is errneous or very latent. Errors are marked with red highlights and duration is represented by the horizontal length of the span, so very long spans are the slow ones. Learn more about using the Flamegraph in the Trace View guide.

    Under the Flamegraph you can see all of the tags (including custom ones). From here you can also see associated logs (if you connected Logs to your Traces), see Host-level information such as CPU and memory usage.

  5. Click into the Host tab, observe the CPU and memory performance of the underlying host while the request was hitting it.

  6. Click Open Host Dashboard to view all relevant data about the host

Datadog APM seamlessly integrates with the other Datadog metrics and information - like infrastructure metrics and Logs. Using the Flamegraph, this information is available to you as well as any custom metadata you are sending with your traces.

Further Reading