APM Troubleshooting
New announcements from Dash: Incident Management, Continuous Profiler, and more! New announcements from Dash!

APM Troubleshooting

If you experience unexpected behavior with Datadog APM, there are a few common issues you can investigate and this guide may help resolve issues quickly. Reach out to Datadog support for further assistance.

Confirm APM setup and Agent status

During startup, all Datadog tracing libraries past the versions listed below emit logs that reflect the configurations applied in a JSON object, as well as any errors encountered, including if the Agent can be reached in languages where this is possible. If your tracer version includes these startup logs, start your troubleshooting there.


Tracer debug logs

To capture full details on the Datadog tracer, enable debug mode on your tracer by using the DD_TRACE_DEBUG environment variable. You might enable it for your own investigation or because Datadog support recommended it for triage purposes. However, don’t leave debug mode always enabled because of the logging overhead it introduces.

These logs can surface instrumentation errors or integration-specific errors. For details on enabling and capturing these debug logs, see the debug mode troubleshooting page.

APM rate limits

Within Datadog Agent logs, if you see error messages about rate limits or max events per second, you can change these limits by following these instructions. If you have questions, before you change the limits, consult with our support team.

Modifying, discarding, or obfuscating spans

There are a number of configuration options available to scrub sensitive data or discard traces corresponding to health checks or other unwanted traffic that can be configured within the Datadog Agent, or in some languages the Tracing Client. For details on the options available, please see the Security and Agent Customization page of the documentation. While this offers representative examples, if you require assistance applying these options to your environment, please reach out to Datadog Support and provide us with details of your desired outcome.

Troubleshooting data requested by Datadog Support

When you open a support ticket, our support team may ask for some combination of the following types of information:

  1. How are you confirming the issue? Provide links to a trace (preferably) or screenshots, for example, and tell us what you expect to see.

    This allows us to confirm errors and attempt to reproduce your issues within our testing environments.

  2. Tracer Startup Logs

    Startup logs are a great way to spot misconfiguration of the tracer, or the inability for the tracer to communicate with the Datadog Agent. By comparing the configuration that the tracer sees to the one set within the application or container, we can identify areas where a setting is not being properly applied.

  3. Tracer Debug Logs

    Tracer Debug logs go one step deeper than startup logs, and will help us to identify if integrations are instrumenting properly in a manner that we aren’t able to necessarily check until traffic flows through the application. Debug logs can be extremely useful for viewing the contents of spans created by the tracer and can surface an error if there is a connection issue when attempting to send spans to the agent. Tracer debug logs are typically the most informative and reliable tool for confirming nuanced behavior of the tracer.

  4. An Agent flare (snapshot of logs and configs) that captures a representative log sample of a time period when traces are sent to your Agent while in debug or trace mode depending on what information we are looking for in these logs.

    Agent flares allow us to see what is happening within the Datadog Agent, or if traces are being rejected or malformed within the Agent. This will not help if traces are not reaching the Agent, but does help us identify the source of an issue, or any metric discrepancies.

    When adjusting the log level to debug or trace mode, please take into consideration that these will significantly increase log volume and therefore consumption of system resources (namely storage space over the long term). We recommend these only be used temporarily for troubleshooting purposes and the level be restored to info afterward.

    Note: If you are using Agent v7.19+ and the Datadog Helm Chart with the latest version, or a DaemonSet where the Datadog Agent and trace-agent are in separate containers, you will need to run the following command with log_level: DEBUG or log_level: TRACE set in your datadog.yaml to get a flare from the trace-agent:


    kubectl exec -it <agent-pod-name> -c trace-agent -- agent flare <case-id> --local
  5. A description of your environment

    Knowing how your application is deployed helps us identify likely issues for tracer-agent communication problems or misconfigurations. For difficult issues, we may ask to a see a Kubernetes manifest or an ECS task definition, for example.

  6. Custom code written using the tracing libraries, such as tracer configuration, custom instrumentation, and adding span tags

    Custom instrumentation can be a very powerful tool, but also can have unintentional side effects on your trace visualizations within Datadog, so we ask about this to rule it out as a suspect. Additionally, asking for your automatic instrumentation and configuration allows us to confirm if this matches what we are seeing in both tracer startup and debug logs.

  7. Versions of languages, frameworks, the Datadog Agent, and Tracing Library being used

    Knowing what versions are being used allows us to ensure integrations are supported in our Compatiblity Requirements section, check for known issues, or to recommend a tracer or language version upgrade if it will address the problem.

Further Reading