---
title: GPU Monitoring Fleet Page
description: >-
  An inventory of all your GPU-accelerated hosts that helps you diagnose
  performance issues.
breadcrumbs: Docs > GPU Monitoring > GPU Monitoring Fleet Page
---

# GPU Monitoring Fleet Page

{% callout %}
# Important note for users on the following Datadog sites: app.ddog-gov.com, us2.ddog-gov.com

{% alert level="danger" %}
This product is not supported for your selected [Datadog site](https://docs.datadoghq.com/getting_started/site.md). ({% placeholder "user-datadog-site-name" /%}).
{% /alert %}

{% /callout %}

## Overview{% #overview %}

The [GPU Fleet page](https://app.datadoghq.com/gpu-monitoring?mConfigure=false&mPage=fleet) provides a detailed inventory of all of your GPU-accelerated hosts for a specified time frame. Use this view to uncover inefficiencies through resource telemetry, ranging from performance and usage metrics to costs. This page also surfaces Datadog's built-in provisioning and performance optimization recommendations for your devices, to help you maximize the value of your GPU spend.

## Break down your fleet by any tag{% #break-down-your-fleet-by-any-tag %}

Use quick filter dropdowns at the top of the page to filter by a specific Provider, Device Type, Cluster, Region, Service, Data Center, Environment, or Team.

You can also Search or Group by other tags using the search and group-by fields. For example, with Host selected, group by `Team` to view a table entry for each unique team. Click the \> button next to any entry to see the hosts used by that team and the GPU devices accelerating those hosts.

**Note**: You can only Group by one additional tag.

If you select Cluster or Host, you can click on the \> button next to each table entry to view a cluster's hosts or a host's devices, respectively.

{% image
   source="https://docs.dd-static.net/images/gpu_monitoring/host_row_expansion-2.55e8c0d64d463126d23d77279547f459.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/gpu_monitoring/host_row_expansion-2.55e8c0d64d463126d23d77279547f459.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="GPU Fleet table showing services with their device types, with the row expand button highlighted" /%}

**Note**: The Cluster table is only populated if you use Kubernetes.

{% image
   source="https://docs.dd-static.net/images/gpu_monitoring/filters_and_groupings-3.42d57d9a1318bcfc3ef873abe9f06d8b.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/gpu_monitoring/filters_and_groupings-3.42d57d9a1318bcfc3ef873abe9f06d8b.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="Filter dropdowns and Group by selector at the top of the GPU Fleet page" /%}

## Use-case driven views{% #use-case-driven-views %}

Datadog guides you through your provisioning and performance optimization workflows by providing two dedicated use-case driven views.

### Provisioning{% #provisioning %}

The Provisioning tab shows key recommendations and metrics insights for allocating and managing your capacity.

{% image
   source="https://docs.dd-static.net/images/gpu_monitoring/provisioning-tab-2.380e66b07dc0b5fef3f7ce665979d1d5.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/gpu_monitoring/provisioning-tab-2.380e66b07dc0b5fef3f7ce665979d1d5.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="The Provisioning use-case driven view" /%}

Built-in recommendations:

- Datadog proactively detects thermal throttling or hardware defects and instantly recommends remediation based on hardware errors like ECC/XID errors.
- Datadog detects whether inactive devices should be provisioned to avoid having devices sit idle.

Metrics relevant for your provisioning workflow:

- ECC Errors
- XID Errors
- Graphics Engine Activity
- GPU Utilization
- GPU Memory
- Allocated Devices (Only available for Kubernetes users)
- Active Devices
- Idle Cost

### Performance{% #performance %}

The Performance tab helps you understand workload execution and tune GPU utilization to use your devices more effectively.

{% image
   source="https://docs.dd-static.net/images/gpu_monitoring/performance-tab-2.bcc134b19480fb39bc37aa622ead9c36.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/gpu_monitoring/performance-tab-2.bcc134b19480fb39bc37aa622ead9c36.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="The Performance use-case driven view" /%}

Built-in recommendations:

- If your workloads are CPU-intensive, Datadog flags hosts with CPU saturation and recommends solutions.
- If your workloads aren't effectively using their allocated GPU devices, Datadog provides recommendations for tuning workloads to get more value out of their capacity.

Metrics relevant for your performance workflow:

- ECC Errors
- XID Errors
- Graphics Engine Activity
- GPU Utilization
- GPU Memory
- Effective Devices
- Power
- Temperature
- PCIe RX Throughput
- PCIe TX Throughput
- CPU Utilization

## Summary Graph{% #summary-graph %}

After selecting Cluster, Host, or Device, the Summary Graph displays key resource telemetry across your entire GPU infrastructure grouped by that selection. Expand the section below to see a table of the available metrics and what they represent.

{% collapsible-section #gpu-metrics-table %}
#### See the full list of GPU metrics

| Metric                   | Definition                                                                                                                                                                                                                  | Metric Name                                        |
| ------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
| Provisioned Devices      | Breakdown of provisioned devices by active and effective devices.                                                                                                                                                           | `gpu.device.total`                                 |
| Allocated Devices        | (Only available if using Kubernetes) Count of devices that have been allocated to a workload.                                                                                                                               | `gpu.device.total`                                 |
| Active Devices           | Count of devices that are actively used for a workload or are busy. If using Kubernetes: count of allocated devices that are actively used for a workload.                                                                  | `gpu.gr_engine_active`                             |
| Effective Devices        | Count of devices that are used and working for more than 50% of the selected time frame.                                                                                                                                    | `gpu.sm_active`                                    |
| Core Utilization         | (Only available if System Probe enabled) `Cores Used/Cores Limit` for GPU processes. Measure of Temporal Core Utilization.                                                                                                  | `gpu_core_utilization`                             |
| GPU Memory               | Ratio (%) of GPU memory used to total GPU memory limit.                                                                                                                                                                     | `100 - (gpu.memory.free / gpu.memory.limit * 100)` |
| PCIe RX Throughput       | Bytes received through PCI from the GPU device per second.                                                                                                                                                                  | `gpu.pci.throughput.rx`                            |
| PCIe TX Throughput       | Bytes transmitted through PCI to the GPU device per second.                                                                                                                                                                 | `gpu.pci.throughput.tx`                            |
| Graphics Engine Activity | Fraction of time the GPU was performing any compute work during the interval. A coarse signal of whether the GPU is busy or idle.                                                                                           | `gpu.gr_engine_active`                             |
| GPU Utilization          | Average % of time each streaming multiprocessor was active (lower values indicate idle time).                                                                                                                               | `gpu.sm_active`                                    |
| Power                    | Power usage for the GPU device.**Note**: On GA100 and previous architectures, this represents the instantaneous power at that moment.For newer architectures, it represents the average power draw (Watts) over one second. | `gpu.power.usage`                                  |
| Temperature              | Temperature of a GPU device.                                                                                                                                                                                                | `gpu.temperature`                                  |
| SM Clock                 | SM clock frequency in MHz.                                                                                                                                                                                                  | `gpu.clock_speed.sm`                               |
| Memory Free              | Amount of available / free memory.                                                                                                                                                                                          | `gpu.memory.free`                                  |
| GPU Saturation           | Measures how fully the GPU's parallel execution capacity is being used during the time frame (average ratio of active warps to the maximum warps supported per streaming multiprocessor across all SMs).                    | `gpu.sm_occupancy`                                 |
| NVLink RX                | Total RX of all NVLINK links.                                                                                                                                                                                               | `gpu.nvlink.throughput.raw.rx`                     |
| NVLink TX                | Total TX of all NVLINK links.                                                                                                                                                                                               | `gpu.nvlink.throughput.raw.tx`                     |
| NVLink Active Links      | Number of active NVLINK links for the device.                                                                                                                                                                               | `gpu.nvlink.count.active`                          |
| ECC Errors               | Total count of uncorrected ECC errors.                                                                                                                                                                                      | `gpu.errors.ecc.uncorrected.total`                 |
| XID Errors               | Count of NVIDIA XID errors, indicating hardware or driver-level issues.                                                                                                                                                     | `gpu.errors.xid.total`                             |
| CPU Utilization          | % of time the CPU spent running user space processes.                                                                                                                                                                       | `system.cpu.user`                                  |
| Host Uptime              | Time since the host was last started                                                                                                                                                                                        | `system.uptime`                                    |
| Host I/O Utilization     | % of CPU time during which I/O requests were issued to the GPU device.                                                                                                                                                      | `system.io.util`                                   |
| Host Memory              | % of usable memory in use.                                                                                                                                                                                                  | `system.mem.pct_usable`                            |

{% /collapsible-section %}

If you've selected an additional tag to group by—for example, *team*—every unique timeseries in the Summary Graph corresponds to a team's value for the selected metric.

## Inventory of your GPU-powered infrastructure{% #inventory-of-your-gpu-powered-infrastructure %}

This table breaks down your GPU-powered infrastructure by any tag of your choosing. If you haven't specified an additional tag in the Group by field, results are grouped by your selected view: Cluster, Host, or Device.

By default, the table of results displays the following columns:

- Device Name
- Graphics Engine Activity
- GPU Utilization (Only if System Probe is enabled)
- Core Utilization
- GPU Memory
- Idle Cost
- Recommendation

You can click on the gear icon to customize which metrics are displayed within the table. Expand the section below to see a full list of the available metrics.

{% collapsible-section #metric-full-list %}
#### See the full list of available metrics

| Category        | Metric                   | Definition                                                                                                                                                                                                                  | Metric Name                                        |
| --------------- | ------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
| —               | Device Name              | Type of GPU device.                                                                                                                                                                                                         | `gpu_device`                                       |
| Hardware Health | Total Errors             | Total count of errors for the resource.                                                                                                                                                                                     | `gpu.errors.total`                                 |
| Hardware Health | ECC Errors               | Total count of uncorrected ECC errors.                                                                                                                                                                                      | `gpu.errors.ecc.uncorrected.total`                 |
| Hardware Health | XID Errors               | Count of NVIDIA XID errors, indicating hardware or driver-level issues.                                                                                                                                                     | `gpu.errors.xid.total`                             |
| Utilization     | Graphics Engine Activity | Fraction of time the GPU was performing any compute work during the interval. A coarse signal of whether the GPU is busy or idle.                                                                                           | `gpu.gr_engine_active`                             |
| Utilization     | GPU Saturation           | Measures how fully the GPU's parallel execution capacity is being used during the time frame (average ratio of active warps to the maximum warps supported per streaming multiprocessor across all SMs).                    | `gpu.sm_occupancy`                                 |
| Utilization     | Core Utilization         | (Only available if System Probe enabled) `Cores Used/Cores Limit` for GPU processes. Measure of Temporal Core Utilization.                                                                                                  | `gpu_core_utilization`                             |
| Utilization     | GPU Idle                 | % of time the GPU device is idle.                                                                                                                                                                                           | `100-gpu.gr_engine_active`                         |
| Provisioning    | Idle Cost                | (Only nonzero for time frames longer than 2 days) The cost of GPU resources that are reserved and allocated, but not used.                                                                                                  |
| Provisioning    | Allocated Devices        | (Only available if using Kubernetes) Count of devices that have been allocated to a workload.                                                                                                                               | `gpu.device.total`                                 |
| Provisioning    | Unallocated Devices      | Count of devices not allocated and available for use during time frame.                                                                                                                                                     |
| Provisioning    | Active Devices           | Count of devices that are actively used for a workload or are busy. If using Kubernetes: count of allocated devices that are actively used for a workload.                                                                  | `gpu.gr_engine_active`                             |
| Provisioning    | Effective Devices        | Count of devices that are used and working for more than 50% of the selected time frame.                                                                                                                                    | `gpu.sm_active`                                    |
| Performance     | CPU Utilization          | % of time the CPU spent running user space processes.                                                                                                                                                                       | `system.cpu.user`                                  |
| Performance     | Host Uptime              | Time since the host was last started                                                                                                                                                                                        | `system.uptime`                                    |
| Performance     | Host I/O Utilization     | % of CPU time during which I/O requests were issued to the GPU device.                                                                                                                                                      | `system.io.util`                                   |
| Performance     | Host Memory              | % of usable memory in use.                                                                                                                                                                                                  | `system.mem.pct_usable`                            |
| Performance     | GPU Utilization          | Average % of time each streaming multiprocessor was active (lower values indicate idle time).                                                                                                                               | `gpu.sm_active`                                    |
| Performance     | GPU Memory               | Ratio (%) of GPU memory used to total GPU memory limit.                                                                                                                                                                     | `100 - (gpu.memory.free / gpu.memory.limit * 100)` |
| Performance     | Power                    | Power usage for the GPU device.**Note**: On GA100 and previous architectures, this represents the instantaneous power at that moment.For newer architectures, it represents the average power draw (Watts) over one second. | `gpu.power.usage`                                  |
| Performance     | Temperature              | Temperature of a GPU device.                                                                                                                                                                                                | `gpu.temperature`                                  |
| Performance     | SM Clock                 | SM clock frequency in MHz.                                                                                                                                                                                                  | `gpu.clock_speed.sm`                               |
| Performance     | PCIe RX Throughput       | Bytes received through PCI from the GPU device per second.                                                                                                                                                                  | `gpu.pci.throughput.rx`                            |
| Performance     | PCIe TX Throughput       | Bytes transmitted through PCI to the GPU device per second.                                                                                                                                                                 | `gpu.pci.throughput.tx`                            |
| Performance     | NVLink RX                | Total RX of all NVLINK links.                                                                                                                                                                                               | `gpu.nvlink.throughput.raw.rx`                     |
| Performance     | NVLink TX                | Total TX of all NVLINK links.                                                                                                                                                                                               | `gpu.nvlink.throughput.raw.tx`                     |
| Performance     | NVLink Active Links      | Number of active NVLINK links for the device.                                                                                                                                                                               | `gpu.nvlink.count.active`                          |

{% /collapsible-section %}

## Details side panel{% #details-side-panel %}

Clicking any row in the Fleet table opens a side panel with more details for the selected cluster, host, or device.

### Connected Entities{% #connected-entities %}

Datadog's GPU Monitoring doesn't need to rely on NVIDIA'S DCGM Exporter. It uses the Datadog Agent to observe GPUs directly, providing insight into GPU usage and costs for pods and processes. Under the Connected Entities section in any detail view, you can see SM activity, GPU core utilization (only if System Probe is enabled), and the memory usage of pods, processes, and Slurm jobs. This helps you identify which workloads to cut or optimize to decrease total spend.

**Note**: The Pods tab is only available if you're using Kubernetes.

{% tab title="Cluster side panel" %}
Within this side panel, you have a cluster-specific funnel that identifies:

- Number of Total, Allocated (Kubernetes users only) , Active, and Effective devices within that particular cluster

- Estimated total and idle cost of that cluster

- Connected entities of that cluster: pods, processes, and Slurm jobs

- Four key metrics (customizable) for that cluster: Core Utilization (only if System probe is enabled), Memory Utilization, PCIe Throughput, and Graphics Activity

- Table of hosts associated with that cluster

  {% image
     source="https://docs.dd-static.net/images/gpu_monitoring/cluster_sidepanel.880f2caeb503225de4a517d04227b39f.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/gpu_monitoring/cluster_sidepanel.880f2caeb503225de4a517d04227b39f.png?auto=format&fit=max&w=850&dpr=2 2x"
     alt="Cluster specific side panel that breaks down idle devices, costs and connected entities" /%}

{% /tab %}

{% tab title="Host side panel" %}
Within this side panel, you have a host-specific view that identifies:

- Host-related metadata such as provider, instance type, CPU utilization, system memory used, system memory total, system IO util, SM activity, and temperature

- (only available for Kubernetes users) The specific GPU devices allocated to that host sorted by Graphics Engine Activity

- Connected Entities of that host: pods, processes, and Slurm jobs

  {% image
     source="https://docs.dd-static.net/images/gpu_monitoring/host_sidepanel.d4d52b29cb0427c321e8b749604c2df0.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/gpu_monitoring/host_sidepanel.d4d52b29cb0427c321e8b749604c2df0.png?auto=format&fit=max&w=850&dpr=2 2x"
     alt="Host specific side panel that displays the GPU devices tied to that host and Connected Entities" /%}

{% /tab %}

{% tab title="Device side panel" %}
Within this side panel, you have a device-specific view that identifies:

- Recommendations (if any) for how to use this device more effectively

- Device-related details: device type, SM activity, and temperature

- Four key metrics tied to GPUs: SM Activity, Memory Utilization, Power, and Graphics Engine Activity

- Connected Entities of that cluster: pods and processes

  {% image
     source="https://docs.dd-static.net/images/gpu_monitoring/device_sidepanel.1fdda7ff5d6507f7232eceaca8d20be2.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/gpu_monitoring/device_sidepanel.1fdda7ff5d6507f7232eceaca8d20be2.png?auto=format&fit=max&w=850&dpr=2 2x"
     alt="Device specific side panel that displays recommendations for how to use the device more effectively and other key telemetry." /%}

{% /tab %}

## Installation recommendations{% #installation-recommendations %}

Datadog actively surveys your infrastructure and detects installation gaps that may diminish the value you get out of GPU Monitoring. In this modal, you can find installation update recommendations to get the optimal value of GPU Monitoring. For example, making sure your hosts have the [latest version](https://github.com/DataDog/datadog-agent/releases) of the Datadog Agent installed, installing the latest version of the NVIDIA driver, and checking for misconfigured hosts.

To view advanced GPU Monitoring features such as attribution of GPU resources by related processes or SLURM jobs, you must enable [Live Processes](https://docs.datadoghq.com/infrastructure/process.md) and the [Slurm](https://docs.datadoghq.com/integrations/slurm.md) integration, respectively.

{% image
   source="https://docs.dd-static.net/images/gpu_monitoring/installation.d39249d01b9f30bce8947c4d29a9736f.png?auto=format&fit=max&w=850 1x, https://docs.dd-static.net/images/gpu_monitoring/installation.d39249d01b9f30bce8947c4d29a9736f.png?auto=format&fit=max&w=850&dpr=2 2x"
   alt="Modal containing installation guidance for smoother GPU Monitoring user experience." /%}

## Further reading{% #further-reading %}

- [Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring](https://www.datadoghq.com/blog/datadog-gpu-monitoring/)