Network Health is in Preview. Contact your Datadog representative to sign up.

Overview

Network Health provides a unified view of your network’s most critical issues, automatically detecting and prioritizing problems across DNS, TLS certificates, security groups, and network anomalies. It surfaces actionable insights with clear remediation paths, helping you resolve connectivity issues and reduce incident impact.

This page describes the sections of the Network Health page and the issues and insights surfaced in each.

The Network Health page with the collapsible menu open, highlighting Recommended Actions.

Prerequisites

The Recommended Actions section highlights the most critical issues detected in your network. These are prioritized based on:

  1. Severity: Whether the issue is actively blocking traffic
  2. Impact: How critical the affected services are to your infrastructure

Each recommended action displays:

  • The specific problem detected (for example, “TLS certificate expired N days ago”)
  • The impacted client service (the service making requests)
  • The impacted server service (the service receiving requests)

Hover over a service name to pivot to APM, or click Remediate to view remediation steps along with options to create a New Workflow, Create a Case, or Declare an Incident.

Recommended actions side panel of an affected service, showing remediation steps.

Watchdog Insights

The Watchdog Insights section displays anomalous network behavior detected by Watchdog, focusing on spikes in TCP retransmits. An increase in retransmits compared to your baseline (typically the previous week) often indicates an underlying network issue. See the Watchdog Insights documentation for more information.

Use Watchdog Insights to:

  • Detect potential problems early
  • Correlate anomalies with specific root causes
  • Investigate performance degradation before it impacts users

TLS certificates

Expired or expiring TLS certificates can block secure connections between services, resulting in dropped traffic. The TLS Certificates section lists:

  • Expired certificates: Certificates that are invalid and blocking traffic
  • Expiring certificates: Certificates about to expire
  • Impacted services: The client and server services affected by each certificate issue (note that the client “service” may be an AWS load balancer, such as an Application Load Balancer)

Click an expired certificate to view steps for renewing it in AWS, or to create a New Workflow, Create a Case, or Declare an Incident.

DNS failures

DNS misconfigurations can route traffic to incorrect destinations, preventing services from communicating. These failures typically result from changes made to DNS routing configurations.

The DNS Failures section shows:

  • Failure reason: The cause of the DNS failure
  • Impacted DNS server: The DNS server experiencing elevated failure rates
  • Impacted services: The client and server services affected by the DNS failure

Failure reasons:

NXDOMAIN
The domain name does not exist, usually due to a misconfiguration or removed domain.
TIMEOUT
The DNS query timed out before receiving a response, which may indicate network issues or unresponsive DNS servers.
SERVFAIL
The DNS server failed to process a query, often due to a server-side problem.

Hover over a service name to pivot to APM, or click on a recommended action to view remediation steps along with options to create a New workflow, Create a Case, or Declare an Incident.

Security groups

Security groups control traffic flow in cloud environments through allow and deny rules. Because security groups deny traffic by default, accidental rule deletions or modifications can immediately block legitimate traffic between services.

Note: Security group monitoring is available only for AWS and requires EC2 resource collection to be enabled in your AWS integration.

The Security Groups section identifies:

  • Security group misconfigurations blocking traffic
  • The specific services unable to communicate
  • Recent changes to security group rules

Resolution:

  1. Click on a security group issue to open the side panel.
  2. Select View in AWS to navigate to the AWS console.
  3. Review and modify the inbound and outbound rules.
  4. Use the Infrastructure Change Tracking data in the side panel to identify when the change occurred and revert it if necessary.

Filtering

Use the filters at the top of the page to narrow the scope of displayed issues. Available filters include:

  • Service: View issues affecting a particular service
  • Team: View issues owned by a specific team

Further Reading