This section includes the following topics:
What are metrics?
Metrics are numerical values that can track anything about your environment over time, from latency to error rates to user signups.
In Datadog, metric data is ingested and stored as data points with a value and timestamp:
A sequence of data points is stored as a timeseries:
[ 17.82, 22:11:01 ]
[ 6.38, 22:11:12 ]
[ 2.87, 22:11:38 ]
[ 7.06, 22:12:00 ]
Any metrics with fractions of a second timestamps are rounded to the nearest second. If any points have the same timestamp, the latest point overwrites the previous ones.
Why are metrics useful?
Metrics provide an overall picture of your system. You can use them to assess the health of your environment at a glance. Visualize how quickly users are loading your website, or the average memory consumption of your servers, for instance. Once you identify a problem, you can use logs and tracing to further troubleshoot.
Metrics that track system health come automatically through Datadog’s integrations with more than
650 services. You can also track metrics that are specific to your business—also known as custom metrics. You can track things such as the number of user logins or user cart sizes to the frequency of your team’s code commits.
In addition, metrics can help you adjust the scale of your environment to meet the demand from your customers. Knowing exactly how much you need to consume in resources can help you save money or improve performance.
Visualizing metrics in Datadog
You can visualize your metrics and create graphs throughout Datadog: in Metrics Explorer, Dashboards, or Notebooks.
Here’s an example of a timeseries visualization:
This line graph plots latency (in milliseconds) experienced by users on the y-axis against time on the x-axis.
Datadog offers a variety of visualization options to help users easily graph and display their metrics.
A metric query consists of the same two evaluation steps to start: time aggregation and space aggregation. See the anatomy of a metric query for more information.
Two visualization offerings that Metrics users often find useful are:
Additionally, Datadog has many other types of graphs and widgets for visualizations. You can learn more about them in Datadog’s blog series about metric graphs.
Submitting metrics to Datadog
Metrics can be sent to Datadog from several places.
Datadog-Supported Integrations: Datadog’s
650+ integrations include metrics out of the box. To access these metrics, navigate to the specific integration page for your service and follow the installation instructions there. If you need to monitor an EC2 instance, for example, you would go to the Amazon EC2 integration documentation.
You can generate metrics directly within the Datadog platform. For instance, you can count error status codes appearing in your logs and store that as a new metric in Datadog.
Often, you’ll need to track metrics related to your business (for example, number of user logins or signups). In these cases, you can create custom metrics. Custom metrics can be submitted through the Agent, DogStatsD, or the HTTP API.
Additionally, the Datadog Agent automatically sends several standard metrics (such as CPU and disk usage).
For a summary of all metric submission sources and methods, read the Metrics Types documentation.
The graphing experience is consistent whether you are using dashboards, notebooks, or monitors. You can create graphs by using the graphing editor UI or by directly changing the raw query string. To edit the query string, use the
</> button on the far right.
Anatomy of a metric query
A metric query in Datadog looks like this:
You can break this query into a few steps:
First, choose the specific metric that you’d like to graph by searching or selecting it from the dropdown next to Metric. If you’re not sure which metric to use, start with the Metrics Explorer or a notebook. You can also see a list of actively reporting metrics on the Metrics Summary page.
Filter your metric
After selecting a metric, you can filter your query based on tag(s). For instance, you can use
account:prod to scope your query to include only the metrics from your production hosts. For more information, read the tagging documentation.
Configure time aggregation
Next, choose the granularity of your data using time rollup. In this example, you’ve defined that there is one data point for every hour (3600 seconds). You can also choose how you want to aggregate the data in each time bucket. By default, avg is applied, but other available options are sum, min, max, and count. If you wanted to apply max, you would use
Configure space aggregation
In Datadog, “space” refers to the way metrics are distributed over different hosts and tags. There are two different aspects of space that you can control: aggregator and grouping
Aggregator defines how the metrics in each group are combined. There are four aggregations available: sum, min, max, and avg.
Grouping defines what constitutes a line on the graph. For example, if you have hundreds of hosts spread across four regions, grouping by region allows you to graph one line for every region. This would reduce the number of timeseries to four.
Apply functions (optional)
You can modify your graph values with mathematical functions. This can mean performing arithmetic between an integer and a metric (for example, multiplying a metric by 2). Or performing arithmetic between two metrics (for example, creating a new timeseries for the memory utilization rate like this:
jvm.heap_memory / jvm.heap_memory_max).
Time and space aggregation
Time aggregation and space aggregation are two important components of any query. Because understanding how these aggregations work helps you avoid misinterpreting your graphs, these concepts are explained in more detail below.
Datadog stores a large volume of points, and in most cases it’s not possible to display all of them on a graph. There would be more datapoints than pixels. Datadog uses time aggregation to solve this problem by combining data points into time buckets. For example, when examining four hours, data points are combined into two-minute buckets. This is called a rollup. As the time interval you’ve defined for your query increases, the granularity of your data decreases.
There are five aggregations you can apply to combine your data in each time bucket: sum, min, max, avg, and count.
It’s important to remember that time aggregation is always applied in every query you make.
Space aggregation splits a single metric into multiple timeseries by tags such as host, container, and region. For instance, if you wanted to view the latency of your EC2 instances by region, you would need to use space aggregation’s grouping by functionality to combine each region’s hosts.
There are four aggregators that can be applied when using space aggregation: sum, min, max, and avg. Using the above example, say that your hosts are spread across four regions: us-east-1, us-east-2, us-west-1, and us-west-2. The hosts in each region need to be combined using an aggregator function. Using the max aggregator would result in the maximum latency experienced across hosts in each region, while the avg aggregator would yield the average latency per region.
Metric types and real-time metrics visibility
Datadog supports several different metric types that serve distinct use cases: count, gauge, rate, histogram, and distribution. Metric types determine which graphs and functions are available to use with the metric in the app.
The Datadog Agent doesn’t make a separate request to Datadog’s servers for every single data point you send. Instead, it reports values collected over a flush time interval. The metric’s type determines how the values collected from your host over this interval are aggregated for submission.
A count type adds up all the submitted values in a time interval. This would be suitable for a metric tracking the number of website hits, for instance.
The rate type takes the count and divides it by the length of the time interval. This is useful if you’re interested in the number of hits per second.
A gauge type takes the last value reported during the interval. This type would make sense for tracking RAM or CPU usage, where taking the last value provides a representative picture of the host’s behavior during the time interval. In this case, using a different type such as count would probably lead to inaccurate and extreme values. Choosing the correct metric type ensures accurate data.
A histogram reports five different values summarizing the submitted values: the average, count, median, 95th percentile, and max. This produces five different timeseries. This metric type is suitable for things like latency, for which it’s not enough to know the average value. Histograms allow you to understand how your data was spread out without recording every single data point.
A distribution is similar to a histogram, but it summarizes values submitted during a time interval across all hosts in your environment. You can also choose to report multiple percentiles: p50, p75, p90, p95, and p99. You can learn more about this powerful feature in the Distributions documentation.
See the metrics types documentation for more detailed examples of each metric type and submission instructions.
View real-time information about metrics
The Metrics Summary page displays a list of your metrics reported to Datadog under a specified time frame: the past hour, day, or week. Metrics can be filtered by metric name or tag.
Click on any metric name to display a details sidepanel with more detailed information. The details sidepanel displays key information for a given metric, including its metadata (type, unit, interval), number of distinct metrics, number of reporting hosts, number of tags submitted, and a table containing all tags submitted on a metric. Seeing which tags are being submitted on a metric helps you understand the number of distinct metrics reporting from it, since this number depends on your tag value combinations.
Note: The number of distinct metrics reported in the details sidepanel on Metrics Summary does not define your bill. See your usage details for a precise accounting of your usage over the past month.
Read the metrics summary documentation for more details.
To continue with metrics, check out: