APM Terms and Concepts

The APM UI provides many tools to troubleshoot application performance and correlate it throughout the product, which helps you find and resolve issues in distributed systems.

ConceptDescription
ServiceServices are the building blocks of modern microservice architectures - broadly a service groups together endpoints, queries, or jobs for the purposes of building your application.
ResourceResources represent a particular domain of a customer application - they are typically an instrumented web endpoint, database query, or background job.
MonitorsAPM metric monitors work like regular metric monitors, but with controls tailored specifically to APM. Use these monitors to receive alerts at the service level on hits, errors, and a variety of latency measures.
TraceA trace is used to track the time spent by an application processing a request and the status of this request. Each trace consists of one or more spans.
SpanA span represents a logical unit of work in a distributed system for a given time period. Multiple spans construct a trace.
Service entry spanA span is a service entry span when it is the entrypoint method for a request to a service. You can visualize this within Datadog APM when the color of the immediate parent on a flame graph is a different color.
Trace root spanA span is the root span when it is the entrypoint method for the trace. Its start marks the beginning of the trace
Trace metricsTrace metrics are automatically collected and kept with a 15-month retention policy similar to other Datadog metrics. They can be used to identify and alert on hits, errors, or latency. Statistics and metrics are always calculated based on all traces, and are not impacted by ingestion controls.
Indexed SpanIndexed Spans represent all spans indexed by retention filters or legacy App Analytics analyzed spans and can be used to search, query, and monitor in Analytics.
Span tagsTag spans in the form of key-value pairs to correlate a request in the Trace View or filter in Analytics.
Retention FiltersRetention filters are tag-based controls set within the Datadog UI that determine what spans to index in Datadog for 15 days.
Ingestion ControlsIngestion controls are used to send up to 100% of traces to Datadog for live search and analytics for 15 minutes.
Sublayer MetricA sublayer metric is the execution duration of a given type / service within a trace.
Execution TimeTotal time that a span is considered ‘active’ (not waiting for a child span to complete).

Services

After instrumenting your application, the Services List is your main landing page for APM data.

service list

Services are the building blocks of modern microservice architectures - broadly a service groups together endpoints, queries, or jobs for the purposes of scaling instances. Some examples:

  • A group of URL endpoints may be grouped together under an API service.
  • A group of DB queries that are grouped together within one database service.
  • A group of periodic jobs configured in the crond service.

The screenshot below is a microservice distributed system for an e-commerce site builder. There’s a web-store, ad-server, payment-db, and auth-service all represented as services in APM.

service map

All services can be found in the Service List and visually represented on the Service Map. Each service has its own Service page where trace metrics like throughput, latency, and error rates can be viewed and inspected. Use these metrics to create dashboard widgets, create monitors, and see the performance of every resource such as a web endpoint or database query belonging to the service.

Don’t see the HTTP endpoints you were expecting on the Service page? In APM, endpoints are connected to a service by more than the service name. It is also done with the `span.name` of the entry-point span of the trace. For example, on the web-store service above, `web.request` is the entry-point span. More info on this here.

Resources

Resources represent a particular domain of a customer application. They could typically be an instrumented web endpoint, database query, or background job. For a web service, these resources can be dynamic web endpoints that are grouped by a static span name - web.request. In a database service, these would be database queries with the span name db.query. For example the web-store service has automatically instrumented resources - web endpoints - which handle checkouts, updating_carts, add_item, etc. Each resource has its own Resource page with trace metrics scoped to the specific endpoint. Trace metrics can be used like any other Datadog metric - they are exportable to a dashboard or can be used to create monitors. The Resource page also shows the span summary widget with an aggregate view of spans for all traces, latency distribution of requests, and traces which show requests made to this endpoint.

Trace

A trace is used to track the time spent by an application processing a request and the status of this request. Each trace consists of one or more spans. During the lifetime of the request, you can see distributed calls across services (because a trace-id is injected/extracted through HTTP headers), automatically instrumented libraries, and manual instrumentation using open-source tools like OpenTracing in the flame graph view. In the Trace View page, each trace collects information that connects it to other parts of the platform, including connecting logs to traces, adding tags to spans, and collecting runtime metrics.

trace view

Spans

A span represents a logical unit of work in the system for a given time period. Each span consists of a span.name, start time, duration, and span tags. For example, a span can describe the time spent on a distributed call on a separate machine, or the time spent in a small component within a larger request. Spans can be nested within each other, which creates a parent-child relationship between the spans.

For the example below, the span rack.request is the entry-point span of the trace. This means the web-store service page is displaying resources that consist of traces with an entry-point span named rack.request. The example also shows the tags added application side (merchant.store_name, merchant.tier, etc). These user-defined tags can be used to search and analyze APM data in Analytics.

span

Service entry span

A span is a service entry span when it is the entrypoint method for a request to a service. You can visualize this within Datadog APM when the color of the immediate parent on a flame graph is a different color. Services are also listed on the right when viewing a flame graph.

Trace root span

A span is a trace root span when it is the first span of a trace. The root span is the entry-point method of the traced request. Its start marks the beginning of the trace.

For the example below, the service entry spans are:

  • rack.request (which is also the root span)
  • aspnet_coremvc.request
  • The topmost green span below aspnet_coremvc.request
  • Every orange mongodb span
span

Span summary

The span summary table shows metrics for spans aggregated across all traces, including how often the span shows up among all traces, what percent of traces contain the span, the average duration for the span, and its typical share of total execution time of the requests. This helps you detect N+1 problems in your code so you can improve your application performance.

The span summary table contains the following columns:

Average spans per trace
Average number of occurrences of the span for traces, including the current resource, where the span is present at least once.
Percentage of traces
Percentage of traces, including the current resource, where the span is present at least once.
Average duration
Average duration of the span for traces, including the current resource, where the span is present at least once.
Average percentage of execution time
Average ratio of execution time for which the span was active for traces, including the current resource, where the span is present at least once.
Span summary table

Trace metrics

Trace metrics are automatically collected and kept at a 15-month retention policy similar to any other Datadog metric. They can be used to identify and alert on hits, errors, or latency. Trace metrics are tagged by the host receiving traces along with the service or resource. For example, after instrumenting a web service trace metrics are collected for the entry-point span web.request in the Metric Summary.

Dashboard

Trace metrics can be exported to a dashboard from the Service or Resource page. Additionally, trace metrics can be queried from an existing dashboard.

Monitoring

Trace metrics are useful for monitoring. APM monitors can be set up on the New Monitors, Service, or Resource page. A set of suggested monitors is available on the Service, or Resource page.

Trace Explorer

Explore and perform analytics on 100% of ingested traces for 15 minutes and all indexed spans for 15 days.

Indexed span

Indexed Spans represent spans indexed by a retention filter stored in Datadog for 15 days that can be used to search, query, and monitor in Trace Search and Analytics by the tags included on the span.

Creating tag based retention filters after ingestion allows you to control and visualize exactly how many spans are being indexed per service.

Span tags

Tag spans in the form of key-value pairs to correlate a request in the Trace View or filter in Analytics. Tags can be added to a single span or globally to all spans. For the example below, the requests (merchant.store_name, merchant.tier, etc.) have been added as tags to the span.

span tag

To get started tagging spans in your application, check out this walkthrough.

After a tag has been added to a span, search and query on the tag in Analytics by clicking on the tag to add it as a facet. Once this is done, the value of this tag is stored for all new traces and can be used in the search bar, facet panel, and trace graph query.

Create Facet

Retention filters

Set tag-based filters in the Datadog UI to index spans for 15 days for use with Trace Search and Analytics

Ingestion controls

Send 100% of traces from your services to Datadog and combine with tag-based retention filters to keep traces that matter for your business for 15 days.

Sublayer metric

Some Tracing Application Metrics are tagged with sublayer_service and sublayer_type so that you can see the execution time for individual services within a trace.

Execution time

Execution time is calculated by adding up the time that a span is active, meaning it has no child spans. For non-concurrent work, this is straightforward. In the following image, the execution time for Span 1 is $\D1 + \D2 + \D3$. The execution time for Spans 2 and 3 are their respective widths.

Execution time

When child spans are concurrent, execution time is calculated by dividing the overlapping time by the number of concurrently active spans. In the following image, Spans 2 and 3 are concurrent (both are children of Span 1), overlapping for the duration of Span 3, so the execution time of Span 2 is $\D2 ÷ 2 + \D3$, and the execution time of Span 3 is $\D2 ÷ 2$.

Execution time for concurrent work

Further Reading