Metrics are numerical values that can track anything about your environment over time, from latency to error rates to user signups.
Metrics provide an overall picture of your system. You can use them to assess the health of your environment at a glance—how quickly users are loading your website or the average memory consumption of your servers, for instance. Once you identify a problem, you can use logs and tracing to locate the exact source of the issue.
Metrics that track system health come automatically through our integrations with 400+ services. However, you can also track metrics that are specific to your business—also known as custom metrics—from the number of user logins to user cart sizes to the frequency of your team’s code commits.
In addition, metrics can help you adjust the scale of your environment in order to meet the demand from your customers. Knowing exactly how much you need to consume in resources can help you save money or improve performance.
Here’s an example of how a timeseries visualization would look:
This line graph plots latency (in milliseconds) experienced by users on the y-axis against time on the x-axis.
Datadog has many other types of graphs and widgets for visualizations. You can learn more about them here.
Metrics can be sent to Datadog from several places.
Datadog-Supported Integrations: Metrics sent from one of our 400+ integrations are included with your infrastructure plan. To set up metrics from external integrations, navigate to the specific page for your service and follow the installation instructions there. If you need to monitor an EC2 instance, for example, you would go to this page.
Many of our products generate metrics directly. For instance, you can count error status codes appearing in your logs and store that as a new metric in Datadog.
Often, you’ll need to track metrics related to your business (e.g. number of user logins/signups). In these cases, you can create custom metrics. Custom metrics can be submitted through the agent, DogStatsD, or the HTTP API.
Additionally, the Datadog Agent automatically sends several standard metrics (such as CPU and disk usage).
For a summary of all metric submission sources and methods, please refer to our Metrics Types documentation.
Whether you are using metrics, monitors, dashboards, notebooks, etc., all graphs in Datadog have the same basic functionality. You can create graphs either by using the graphing editor UI or by directly changing the raw query string. To edit the query string, hit the
</> button on the far right.
A metric query in Datadog looks like this:
We can break this query into a few steps:
First, choose the specific metric that you’d like to graph by searching or selecting it from the dropdown next to Metric. If you’re not sure which metric to use, start with the Metrics Explorer or a notebook. You can also see a list of metrics on the Metrics Summary page.
After selecting a metric, you can filter your query based on tag(s). For instance, you can use
account:prod to scope your query to include only the metrics from your production hosts. For more information, please refer to our Tagging documentation.
Next, choose the granularity of your data using time rollup. In this example, we’ve defined that there will be one data point for every six minutes (360 seconds). You can also choose how you want to aggregate the data in each time bucket. By default, avg is applied, but other available options are sum, min, max, and count. If you wanted to apply max, you would use
In Datadog, “space” refers to the way metrics are distributed over different hosts and tags. There are two different aspects of space that you can control: grouping and aggregation.
Grouping defines what constitutes a line on the graph. For example, if you have hundreds of hosts spread across four regions, grouping by region allows you to graph one line for every region, reducing the number of timeseries to four.
Aggregation defines how the metrics in each group are combined. There are four aggregations available: sum, min, max, and avg.
You can modify your graph values with mathematical functions. This can mean performing arithmetic between an integer and a metric (e.g. multiply a metric by 2), or between two metrics (e.g. create a new timeseries for the memory utilization rate like this:
jvm.heap_memory / jvm.heap_memory_max).
Time aggregation and space aggregation are two important components of any query. Because understanding how these aggregations work will help you avoid misinterpreting your graphs, these concepts are explained in more detail below.
Datadog stores a large volume of points, and in most cases it’s not possible to display them all on a graph—there would be more datapoints than pixels. Therefore, we use time aggregation to solve this problem by combining data points into time buckets. This is called a rollup. As the time interval you’ve defined for your query increases, the granularity of your data becomes coarser.
There are five aggregations you can apply to combine your data in each time bucket: sum, min, max, avg, and count.
It’s important to remember that time aggregation is always applied in every query you make because we can’t display every point we store.
Space aggregation splits a single metric into multiple time series by tags such as host, container, region, etc. For instance, if you were interested in viewing the latency of your EC2 instances by region, you would need to use space aggregation to combine each region’s hosts.
There are four aggregations that can be applied when using space aggregation: sum, min, max, and avg. Using the above example, let’s say that your hosts are spread across four regions: us-east-1, us-east-2, us-west-1, and us-west-2. The hosts in each region need to be combined using an aggregator function. Using the max aggregator would result in the maximum latency experienced across hosts in each region, while the avg aggregator would yield the average latency per region.
Datadog supports several different metric types that serve distinct use cases: count, gauge, rate, histogram, and distribution.
The Datadog agent doesn’t make a separate request to our servers for every single data point you send. Instead, it reports values collected over a flush time interval. The metric’s type determines how the values collected from your host over this interval are aggregated for submission.
A count type will add up all the submitted values in a time interval; this would be suitable for a metric tracking the number of website hits, for instance.
If you’re more interested in the number of hits per second, the rate type takes the count and divides it by the length of the time interval.
A gauge type will take the last value reported during the interval. This type would make sense for tracking RAM or CPU usage, where taking the last value provides a representative picture of the host’s behavior during the time interval. In this case, using a different type such as count would probably lead to inaccurate and extreme values, which highlights the importance of choosing the correct metric type.
A histogram will report five different values summarizing the submitted values: the average, count, median, 95th percentile, and max. This produces five different timeseries. This metric type is suitable for things like latency, for which it’s not enough to know the average value. Histograms allow you to understand how your data was spread out without recording every single data point.
A distribution is similar to a histogram, but it summarizes values submitted during a time interval across all hosts in your environment. You can also choose to report multiple percentiles: p50, p75, p90, p95, and p99. You can learn more about this powerful feature here.
To make this concrete with an example, suppose your host reported metric values of [1,1,1,2,2,2,3,3] during a ten-second interval. Depending on the metric type you chose, Datadog would store completely different values:
Count would add them up and send the value 15 over to our servers, while rate would take the total sum and divide it by 10 seconds to report a value of 1.5. Gauge would simply report the last value, 3. If your metric is a histogram, Datadog would receive five different values: avg = 1.88, count = 8, median = 2, p95 = 3, max = 3.
Metric types also determine which graphs and functions are available to use with the metric in the app. Please see the metrics types documentation for more detailed examples of each metric type and submission instructions.
The Metrics Summary page displays a list of your metrics reported to Datadog under a specified time frame: the past hour, day, or week. Metrics can be filtered by metric name or tag.
Click on any metric name to display a sidepanel with more detailed information. The metric panel displays key information for a given metric, including its metadata (type, unit, interval), number of distinct metrics, number of reporting hosts, number of tags submitted, and a table containing all tags submitted on a metric. Seeing which tags are being submitted on a metric helps you understand the number of distinct metrics reporting from it, since this number depends on your tag value combinations.
Note: The number of distinct metrics reported in the details sidepanel on Metrics Summary does not define your bill. Please see your usage details for a precise accounting of your usage over the past month.
Please see the full Metrics Summary documentation for more details.
To continue with metrics, check out: