GPU Monitoring

GPU Monitoring is not available for the site.

Join the Preview!

GPU Monitoring is in Preview. To join the preview, click Request Access and complete the form.

Request Access

Overview

Datadog’s GPU Monitoring provides a centralized view into GPU fleet health, cost, and performance. GPU Monitoring enables teams to make better provisioning decisions, troubleshoot failed workloads, and eliminate idle GPU costs without having to manually set up individual vendor tools (like NVIDIA’s DCGM). You can access insights into your GPU fleet by deploying the Datadog Agent.

For setup instructions, see Set up GPU Monitoring.

Make data-driven GPU allocation and provisioning decisions

With visibility into GPU utilization by host, node, or pod, you can identify hotspots or underutilization of expensive GPU infrastructure.

Funnel visualization titled 'Your GPU fleet at a glance.' Displays total, allocated, active, and effective devices. Highlights underutilized GPU cores and idle devices.

Troubleshoot failed workloads due to resource contention

Understand your current device availability and forecast how many devices are needed for certain teams or workloads to avoid failed workloads from resource contention.

Charts to help visualize GPU allocation. A line graph titled 'Device Allocation Over Time', plotting counts of total/allocated/active devices, including a 4-week future forecast. A donut chart titled 'Cloud Provider Instance Breakdown', displaying prevalence of cloud provider instances across the fleet. A 'Device Type Breakdown' displaying allocated/total for various GPU devices.

Identify and eliminate wasted, idle GPU costs

Identify total spend on GPU infrastructure and attribute those costs to specific workloads and instances. Directly correlate GPU usage to related pods or processes.

Detail view of a cluster, displaying funnel visualization of devices (total/allocated/active/effective), total cloud cost, idle cloud cost, and visualizations and details of various connected entitles (pods, processors, SLURM jobs).

Maximize model and application performance

With GPU Monitoring’s resource telemetry, you can analyze trends in GPU resources and metrics (including GPU utilization, power, and memory) over time, helping you understand their effects on your model and application performance.

Detail view of a device, displaying configurable timeseries visualizations for SM activity, memory utilization, power, and engine activity.

Ready to start?

See Set up GPU Monitoring for instructions on how to set up Datadog’s GPU Monitoring.

Further Reading