GPU Monitoring

Docs > GPU Monitoring

Cette page n'est pas encore disponible en français, sa traduction est en cours.
Si vous avez des questions ou des retours sur notre projet de traduction actuel, n'hésitez pas à nous contacter.

Join the Preview!

GPU Monitoring is in Preview. To join the preview, click Request Access and complete the form.

Request Access

Overview

Datadog’s GPU Monitoring provides a centralized view into your GPU fleet’s health, cost, and performance. It enables teams to make better provisioning decisions, optimize and troubleshoot AI workload performance, and eliminate idle GPU costs without having to manually set up individual vendor tools (like NVIDIA’s DCGM). GPU Monitoring supports fleets deployed across the major cloud providers (AWS, GCP, Azure, Oracle Cloud), hosted on-premises, or provisioned through GPU-as-a-Service platforms like Coreweave and Lambda Labs.

You can access insights into your GPU fleet by deploying the Datadog Agent on your GPU-accelerated hosts. For setup instructions, see Set up GPU Monitoring.

Key Capabilities

1. Make data-driven GPU allocation and provisioning decisions

With a comprehensive view of your entire fleet and available capacity, Datadog’s GPU Monitoring helps you assign and manage your infrastructure and capacity fairly across your organization.

Funnel visualization titled 'Your GPU fleet at a glance.' Displays total, active, and effective devices. Highlights underutilized GPU cores and idle devices.

You can also understand your current device availability and forecast how many devices are needed for certain teams or workloads to avoid failed workloads from resource contention.

Charts to help visualize GPU allocation. A line graph titled 'Device Allocation Over Time', plotting counts of total/allocated/active devices, including a 4-week future forecast. A donut chart titled 'Cloud Provider Instance Breakdown', displaying prevalence of cloud provider instances across the fleet. A 'Device Type Breakdown' displaying allocated/total for various GPU devices.

2. Maximize model and application performance

With GPU Monitoring’s resource telemetry, you can analyze trends in GPU resources and metrics (including GPU utilization, power, and memory) by host, node, or pod over time, helping you understand devices’ effects on your model and application performance. For example, you can identify hotspots or underutilization of expensive GPU infrastructure that could be bottlenecks for your workloads’ execution

Detail view of a device, displaying configurable timeseries visualizations for SM activity, memory utilization, power, and engine activity.

3. Proactively detect hardware issues

GPUs are an expensive and scarce resource that have higher failure rates than standard servers. Datadog’s GPU Monitoring solution provides OOTB monitors and proactive recommendations to help you detect and remediate hardware issues before they impact your mission-critical workloads.

4. Identify and eliminate wasted, idle GPU costs

Identify total spend on GPU infrastructure and attribute those costs to specific workloads and instances. Directly correlate GPU usage to related pods or processes.