Nvidia NVML

Nvidia NVML

Agent Check Agent Check

Linux Mac OS Windows OS Supported

Overview

This check monitors NVIDIA Management Library (NVML) exposed metrics through the Datadog Agent and can correlate them with the exposed Kubernetes devices.

Setup

The NVML check is not included in the Datadog Agent package, so you need to install it.

Installation

For Agent v7.21+ / v6.21+, follow the instructions below to install the NVML check on your host. See Use Community Integrations to install with the Docker Agent or earlier versions of the Agent.

  1. Run the following command to install the Agent integration:

    datadog-agent integration install -t datadog-nvml==<INTEGRATION_VERSION>
    # You may also need to install dependencies since those aren't packaged into the wheel
    sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml
    
  2. Configure your integration similar to core integrations.

If you are using Docker, there is an example Dockerfile in the NVML repository.

docker build --build-arg=DD_AGENT_VERSION=7.18.0 .

If you’re using Docker and Kubernetes, you need to expose the environment variables NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES. See the included Dockerfile for an example.

If you want to be able to correlate reserved Kubernetes NVIDIA devices with the Kubernetes pod using the device, mount the Unix domain socket /var/lib/kubelet/pod-resources/kubelet.sock into your Agent’s configuration. More information about this socket is on the Kubernetes website. Note: This device is in beta support for version 1.15.

Configuration

  1. Edit the nvml.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your NVML performance data. See the sample nvml.d/conf.yaml for all available configuration options.

  2. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for nvml under the Checks section.

Data Collected

Metrics

nvml.device_count
(gauge)
Number of GPU on this instance.
nvml.gpu_utilization
(gauge)
Percent of time over the past sample period during which one or more kernels was executing on the GPU.
Shown as percent
nvml.mem_copy_utilization
(gauge)
Percent of time over the past sample period during which global (device) memory was being read or written.
Shown as percent
nvml.fb_free
(gauge)
Unallocated FB memory.
Shown as byte
nvml.fb_used
(gauge)
Allocated FB memory.
Shown as byte
nvml.fb_total
(gauge)
Total installed FB memory.
Shown as byte
nvml.power_usage
(gauge)
Power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)
nvml.total_energy_consumption
(count)
Total energy consumption for this GPU in millijoules (mJ) since the driver was last reloaded
nvml.enc_utilization
(gauge)
The current utilization for the Encoder
Shown as percent
nvml.dec_utilization
(gauge)
The current utilization for the Decoder
Shown as percent
nvml.pcie_tx_throughput
(gauge)
PCIe TX utilization
Shown as kibibyte
nvml.pcie_rx_throughput
(gauge)
PCIe RX utilization
Shown as kibibyte
The authoritative metric documentation is on the NVIDIA website.

There is an attempt to, when possible, match metric names with NVIDIA’s Data Center GPU Manager (DCGM) exporter.

Events

NVML does not include any events.

Service Checks

Troubleshooting

Need help? Contact Datadog support.