Network Performance Monitor

Overview

Datadog Network Performance Monitoring (NPM) provides visibility into your network traffic between services, containers, availability zones, and any other tag in Datadog. After you enable NPM, you can create an NPM monitor and get alerted if a TCP network metric crosses a threshold that you have set. For example, you can monitor network throughput between a specific client/server and get alerted if that throughput crosses a threshold.

Monitor creation

To create an NPM monitor in Datadog, use the main navigation: Monitors –> New Monitor –> Network Performance.

Define the search query

Example configuration with auto-grouped client and server traffic, hidden N/A values, measured as the sum of DNS failures metric with a limit of 100
  1. Construct a search query using the same logic as the NPM analytics search bar.
  2. Select the tags you want to group your client and server by.
  3. Choose if you want to show or hide N/A traffic.
  4. Select a metric you want to measure from the dropdown list. By default, the monitor measures the sum of the metric selected. See which metrics are available for NPM monitors in the metric definitions.
  5. Set the limit on how many results you want to be included in the query.

Using formulas and functions

You can create NPM monitors using formulas and functions. This can be used, for example, to create monitors on throughput between a client and server.

The following example shows using a formula to calculate percent retransmits from a client to server.

Example NPM monitor configuration showing percent of retransmits from a client to server

For more information, see the Functions documentation.

Metric definitions

The following tables list the different NPM metrics you can create monitors on.

Volume

Metric nameDefinition
Bytes ReceivedBytes received from client.
Bytes SentBytes sent from client.
Packets SentPackets sent from client.

TCP

Metric nameDefinition
RetransmitsRetransmits between client/server.
LatencyAverage time it takes to make the connection.
RTT (Round-Trip Time)Average time it takes to receive a response.
JitterAverage variance in RTT.
TCP TimeoutsThe number of TCP connections that timed out from the perspective of the operating system. This can indicate general connectivity and latency issues.
TCP RefusalsThe number of TCP connections that were refused by the server. Typically this indicates an attempt to connect to an IP/port that isn’t receiving connections, or a firewall/security misconfiguration.
TCP ResetsThe number of TCP connections that were reset by the server.
Established ConnectionsEstablishes connections between client/server.
Closed ConnectionsClosed connections between client/server.

DNS

Metric nameDefinition
DNS RequestsTotal number of DNS requests.
DNS FailuresTotal number of DNS failures.
DNS TimeoutsTotal number of DNS timeouts.
DNS Failed ResponsesTotal number of DNS failed responses.
DNS Successful ResponsesTotal number of DNS successful responses.
DNS Failure LatencyAverage DNS failure latency.
DNS Success LatencyAverage DNS success latency.
NXDOMAIN ErrorsTotal number of NXDOMAIN errors.
SERVFAIL ErrorsTotal number of SERVFAIL errors.
Other ErrorsTotal number of other errors.

Set alert conditions

Configure monitors to trigger if the query value crosses a threshold and customize advanced alert options for recovery thresholds and evaluations delays. For more information, see Configure Monitors.

Notifications

For detailed instructions on the Configure notifications and automations section, see the Notifications page.

Common monitors

You can start creating monitors on NPM with the following common monitors. These provide a good starting point to track your network and get alerted if your network is experiencing unusual traffic and potentially experiencing unexpected network behavior.

Throughput monitor

The throughput monitor alerts you if throughput between two endpoints specified in the query surpasses a threshold. Monitoring throughput can help you determine if your network is nearing capacity given your network bandwidth. Knowing this can give you enough time to make adjustments to your network to avoid bottlenecks and other effects downstream.

Example configuration for a throughput monitor, set Query A to measure Bytes Sent and add a formula of throughput(a)

Percent retransmits

Retransmission occurs when packets are either damaged or lost and indicate an unreliable network. The percent retransmits monitor alerts you if the percentage of total packets sent that are resulting in retransmits passes a threshold.

Example configuration for a percent transmits monitor, set Query A to measure Retransmits, Query B to measure Packets Sent, and add a formula to calculate the percentage with (a/b)*100

DNS failures

DNS failure monitor tracks DNS server performance to help you identify server-side and client-side DNS issues. Use this monitor to alert you if the sum of DNS failures passes a threshold.

Example configuration for DNS failure, set Query A to measure DNS Failures

Further Reading