Investigate Slow Traces or Endpoints
If your application is showing performance problems in production, integrating distributed tracing with code stack trace benchmarks from profiling is a powerful way to identify the performance bottlenecks. Application processes that have both APM distributed tracing and continuous profiler enabled are automatically linked.
You can move directly from span information to profiling data on the Code Hotspots tab, and find specific lines of code related to performance issues. Similarly, you can also debug slow and resource consuming endpoints directly in the Profiling UI.
Identify code hotspots in slow traces
Prerequisites
Code Hotspots identification is enabled by default when you turn on profiling for your Java service. For manually instrumented code, continuous profiler requires scope activation of spans:
final Span span = tracer.buildSpan("ServicehandlerSpan").start();
try (final Scope scope = tracer.activateSpan(span)) { // mandatory for Datadog continuous profiler to link with span
// worker thread impl
} finally {
// Step 3: Finish Span when work is complete
span.finish();
}
Requires:
- OpenJDK 11+ and
dd-trace-java
version 0.65.0+; or - OpenJDK 8: 8u282+ and
dd-trace-java
version 0.77.0+.
Link from a span to profiling data
From the view of each trace, the Code Hotspots tab highlights profiling data scoped on the selected spans.
The values on the left side is the time spent in that method call during the selected span. Depending on the runtime and language, this list of types varies:
- Method durations shows the overall time taken by each method from your code.
- CPU shows the time taken executing CPU tasks.
- Synchronization shows the time spent waiting on monitors, the time a thread is sleeping and the time it is parked.
- VM operations (Java only) shows the time taken waiting for VM operations that are not related to garbage collection (for example, heap dumps).
- File I/O shows the time taken waiting for a disk read/write operation to execute.
- Socket I/O shows the time taken waiting for a network read/write operation to execute.
- Monitor enter shows the time a thread is blocked on a lock.
- Uncategorized shows the time taken to execute the span that cannot be placed into one of the above categories.
Click the plus icon +
to expand the stack trace to that method in reverse order. Hover over the value to see the percentage of time explained by category.
Viewing a profile from a trace
For each type from the breakdown, click View In Full Page to see the same data opened up in a in a new page . From there you can change visualization to the flame graph.
Click the Focus On selector to define the scope of the data:
- Span & Children scopes the profiling data to the selected span and all descendant spans in the same service.
- Span only scopes the profiling data to the previously selected span.
- Span time period scopes the profiling data to all threads during the time period the span was active.
- Full profile scopes the data to 60 seconds of the whole service process that executed the previously selected span.
Prerequisites
Scope flame graphs by endpoints
Endpoint profiling allows you to scope your flame graphs by any endpoint of your web service to find endpoints that are slow, latency-heavy, and causing poor end-user experience. These endpoints can be tricky to debug and understand why they are slow. The slowness could be caused by an unintended large amount of resource consumption such as the endpoint consuming lots of CPU cycles.
With endpoint profiling you can:
- Identify the bottleneck methods that are slowing down your endpoint’s overall response time.
- Isolate the top endpoints responsible for the consumption of valuable resources such as CPU and wall time. This is particularly helpful when you are generally trying to optimize your service for performance gains.
- Understand if third party code or runtime libraries are the reason for your endpoints being slow or resource-consumption heavy.
Track the endpoints that consume the most resources
It is valuable to track top endpoints that are consuming valuable resources such as CPU and wall time. The list can help you identify if your endpoints have regressed or if you have newly introduced endpoints that are consuming drastically more resources, slowing down your overall service.
Further reading
Additional helpful documentation, links, and articles: