GPU Monitoring is not available for the site.
Join the Preview!
GPU Monitoring is in Preview. To join the preview, click Request Access and complete the form.
Request AccessOverview
Datadog’s GPU Monitoring provides a centralized view into your GPU fleet’s health, cost, and performance. It enables teams to make better provisioning decisions, optimize and troubleshoot AI workload performance, and eliminate idle GPU costs without having to manually set up individual vendor tools (like NVIDIA’s DCGM). GPU Monitoring supports fleets deployed across the major cloud providers (AWS, GCP, Azure, Oracle Cloud), hosted on-premises, or provisioned through GPU-as-a-Service platforms like Coreweave and Lambda Labs.
You can access insights into your GPU fleet by deploying the Datadog Agent on your GPU-accelerated hosts. For setup instructions, see Set up GPU Monitoring.
Key Capabilities
1. Make data-driven GPU allocation and provisioning decisions
With a comprehensive view of your entire fleet and available capacity, Datadog’s GPU Monitoring helps you assign and manage your infrastructure and capacity fairly across your organization.
You can also understand your current device availability and forecast how many devices are needed for certain teams or workloads to avoid failed workloads from resource contention.
With GPU Monitoring’s resource telemetry, you can analyze trends in GPU resources and metrics (including GPU utilization, power, and memory) by host, node, or pod over time, helping you understand devices’ effects on your model and application performance. For example, you can identify hotspots or underutilization of expensive GPU infrastructure that could be bottlenecks for your workloads’ execution
3. Proactively detect hardware issues
GPUs are an expensive and scarce resource that have higher failure rates than standard servers. Datadog’s GPU Monitoring solution provides OOTB monitors and proactive recommendations to help you detect and remediate hardware issues before they impact your mission-critical workloads.
4. Identify and eliminate wasted, idle GPU costs
Identify total spend on GPU infrastructure and attribute those costs to specific workloads and instances. Directly correlate GPU usage to related pods or processes.
Ready to start?
See Set up GPU Monitoring for instructions on how to set up Datadog’s GPU Monitoring.
Further Reading
Additional helpful documentation, links, and articles: