Investigate Slow Traces or Endpoints

If your application is showing performance problems in production, integrating distributed tracing with code stack trace benchmarks from profiling is a powerful way to identify the performance bottlenecks. Application processes that have both APM distributed tracing and continuous profiler enabled are automatically linked.

You can move directly from span information to profiling data on the Code Hotspots tab, and find specific lines of code related to performance issues. Similarly, you can also debug slow and resource consuming endpoints directly in the Profiling UI.

Identify code hotspots in slow traces

Prerequisites

Code Hotspots identification is enabled by default when you turn on profiling for your Java service. For manually instrumented code, continuous profiler requires scope activation of spans:

final Span span = tracer.buildSpan("ServicehandlerSpan").start();
try (final Scope scope = tracer.activateSpan(span)) { // mandatory for Datadog continuous profiler to link with span
    // worker thread impl
  } finally {
    // Step 3: Finish Span when work is complete
    span.finish();
  }

Requires:

  • OpenJDK 11+ and dd-trace-java version 0.65.0+; or
  • OpenJDK 8: 8u282+ and dd-trace-java version 0.77.0+.

Code Hotspots identification is enabled by default when you turn on profiling for your Python service.

Requires dd-trace-py version 0.44.0+.

Code Hotspots identification is enabled by default when you turn on profiling for your Ruby service.

Requires dd-trace-rb version 0.49.0+.

Code Hotspots identification is enabled by default turn on profiling for your Go service.

Require dd-trace-go version 1.37.0+.

Note: This feature works best with Go version 1.18 or newer. Go 1.17 and below have several bugs (see GH-35057, GH-48577, CL-369741, and CL-369983) that can reduce the accuracy of this feature, especially when using a lot of CGO.

Code Hotspots identification is enabled by default when you turn on profiling for your .NET service.

Requires dd-trace-dotnet version 2.7.0+.

Code Hotspots identification is enabled by default when you turn on profiling for your PHP service.

Requires dd-trace-php version 0.71+.

From the view of each trace, the Code Hotspots tab highlights profiling data scoped on the selected spans.

The values on the left side is the time spent in that method call during the selected span. Depending on the runtime and language, this list of types varies:

  • Method durations shows the overall time taken by each method from your code.
  • CPU shows the time taken executing CPU tasks.
  • Synchronization shows the time spent waiting on monitors, the time a thread is sleeping and the time it is parked.
  • VM operations (Java only) shows the time taken waiting for VM operations that are not related to garbage collection (for example, heap dumps).
  • File I/O shows the time taken waiting for a disk read/write operation to execute.
  • Socket I/O shows the time taken waiting for a network read/write operation to execute.
  • Monitor enter shows the time a thread is blocked on a lock.
  • Uncategorized shows the time taken to execute the span that cannot be placed into one of the above categories.

Click the plus icon + to expand the stack trace to that method in reverse order. Hover over the value to see the percentage of time explained by category.

Viewing a profile from a trace

For each type from the breakdown, click View In Full Page to see the same data opened up in a in a new page . From there you can change visualization to the flame graph. Click the Focus On selector to define the scope of the data:

  • Span & Children scopes the profiling data to the selected span and all descendant spans in the same service.
  • Span only scopes the profiling data to the previously selected span.
  • Span time period scopes the profiling data to all threads during the time period the span was active.
  • Full profile scopes the data to 60 seconds of the whole service process that executed the previously selected span.

Break down code performance by API endpoints

Prerequisites

Endpoint profiling is enabled by default when you turn on profiling for your Python service.

Requires dd-trace-py version 0.54.0+.

Endpoint profiling is enabled by default when you turn on profiling for your Go service.

Requires dd-trace-go version 1.37.0+.

Note: This feature works best with Go version 1.18 or newer. Go 1.17 and below have several bugs (see GH-35057, GH-48577, CL-369741, and CL-369983) that can reduce the accuracy of this feature, especially when using a lot of CGO.

Endpoint profiling is enabled by default when you turn on profiling for your Ruby service.

Requires dd-trace-rb version 0.54.0+.

Endpoint profiling is enabled by default when you turn on profiling for your .NET service.

Requires dd-trace-dotnet version 2.15.0+.

Endpoint profiling is enabled by default when you turn on profiling for your PHP service.

Requires dd-trace-php version 0.79.0+.

Scope flame graphs by endpoints

Endpoint profiling allows you to scope your flame graphs by any endpoint of your web service to find endpoints that are slow, latency-heavy, and causing poor end-user experience. These endpoints can be tricky to debug and understand why they are slow. The slowness could be caused by an unintended large amount of resource consumption such as the endpoint consuming lots of CPU cycles.

With endpoint profiling you can:

  • Identify the bottleneck methods that are slowing down your endpoint’s overall response time.
  • Isolate the top endpoints responsible for the consumption of valuable resources such as CPU and wall time. This is particularly helpful when you are generally trying to optimize your service for performance gains.
  • Understand if third party code or runtime libraries are the reason for your endpoints being slow or resource-consumption heavy.

Track the endpoints that consume the most resources

It is valuable to track top endpoints that are consuming valuable resources such as CPU and wall time. The list can help you identify if your endpoints have regressed or if you have newly introduced endpoints that are consuming drastically more resources, slowing down your overall service.

Further reading