Network Health is in Preview. Contact your Datadog representative to sign up.
Overview
Network Health provides a unified view of your network’s most critical issues, automatically detecting and prioritizing problems across DNS, TLS certificates, security groups, and network anomalies. It surfaces actionable insights with clear remediation paths, helping you resolve connectivity issues and reduce incident impact.
This page describes the sections of the Network Health page and the issues and insights surfaced in each.
Prerequisites
Recommended Actions
The Recommended Actions section highlights the most critical issues detected in your network. These are prioritized based on:
- Severity: Whether the issue is actively blocking traffic
- Impact: How critical the affected services are to your infrastructure
Each recommended action displays:
- The specific problem detected (for example, “TLS certificate expired N days ago”)
- The impacted client service (the service making requests)
- The impacted server service (the service receiving requests)
Hover over a service name to pivot to APM, or click Remediate to view remediation steps along with options to create a New Workflow, Create a Case, or Declare an Incident.
Watchdog Insights
The Watchdog Insights section displays anomalous network behavior detected by Watchdog, focusing on spikes in TCP retransmits. An increase in retransmits compared to your baseline (typically the previous week) often indicates an underlying network issue. See the Watchdog Insights documentation for more information.
Use Watchdog Insights to:
- Detect potential problems early
- Correlate anomalies with specific root causes
- Investigate performance degradation before it impacts users
TLS certificates
Expired or expiring TLS certificates can block secure connections between services, resulting in dropped traffic. The TLS Certificates section lists:
- Expired certificates: Certificates that are invalid and blocking traffic
- Expiring certificates: Certificates about to expire
- Impacted services: The client and server services affected by each certificate issue (note that the client “service” may be an AWS load balancer, such as an Application Load Balancer)
Click an expired certificate to view steps for renewing it in AWS, or to create a New Workflow, Create a Case, or Declare an Incident.
DNS failures
DNS misconfigurations can route traffic to incorrect destinations, preventing services from communicating. These failures typically result from changes made to DNS routing configurations.
The DNS Failures section shows:
- Failure reason: The cause of the DNS failure
- Impacted DNS server: The DNS server experiencing elevated failure rates
- Impacted services: The client and server services affected by the DNS failure
Failure reasons:
- NXDOMAIN
- The domain name does not exist, usually due to a misconfiguration or removed domain.
- TIMEOUT
- The DNS query timed out before receiving a response, which may indicate network issues or unresponsive DNS servers.
- SERVFAIL
- The DNS server failed to process a query, often due to a server-side problem.
Hover over a service name to pivot to APM, or click on a recommended action to view remediation steps along with options to create a New workflow, Create a Case, or Declare an Incident.
Security groups
Security groups control traffic flow in cloud environments through allow and deny rules. Because security groups deny traffic by default, accidental rule deletions or modifications can immediately block legitimate traffic between services.
Note: Security group monitoring is available only for AWS and requires EC2 resource collection to be enabled in your AWS integration.
The Security Groups section identifies:
- Security group misconfigurations blocking traffic
- The specific services unable to communicate
- Recent changes to security group rules
Resolution:
- Click on a security group issue to open the side panel.
- Select View in AWS to navigate to the AWS console.
- Review and modify the inbound and outbound rules.
- Use the Infrastructure Change Tracking data in the side panel to identify when the change occurred and revert it if necessary.
Filtering
Use the filters at the top of the page to narrow the scope of displayed issues. Available filters include:
- Service: View issues affecting a particular service
- Team: View issues owned by a specific team
Further Reading
Additional helpful documentation, links, and articles: