---
title: Nvidia NVML
description: Support Nvidia GPU metrics in k8s
breadcrumbs: Docs > Integrations > Nvidia NVML
---

# Nvidia NVML
Supported OS Integration version1.0.9
## Overview{% #overview %}

This check monitors [NVIDIA Management Library (NVML)](https://pypi.org/project/pynvml/) exposed metrics through the Datadog Agent and can correlate them with the [exposed Kubernetes devices](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#monitoring-device-plugin-resources).

## Setup{% #setup %}

The NVML check is not included in the [Datadog Agent](https://app.datadoghq.com/account/settings/agent/latest) package, so you need to install it.

### Installation{% #installation %}

For Agent v7.21+ / v6.21+, follow the instructions below to install the NVML check on your host. See [Use Community Integrations](https://docs.datadoghq.com/agent/guide/use-community-integrations/) to install with the Docker Agent or earlier versions of the Agent.

1. Run the following command to install the Agent integration:

For Linux:

   ```shell
   datadog-agent integration install -t datadog-nvml==<INTEGRATION_VERSION>
   # You may also need to install dependencies since those aren't packaged into the wheel
   sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml==11.5.3
   ```

For Windows (Using Powershell run as admin):

   ```shell
   & "$env:ProgramFiles\Datadog\Datadog Agent\bin\agent.exe" integration install -t datadog-nvml==<INTEGRATION_VERSION>
   # You may also need to install dependencies since those aren't packaged into the wheel
   & "$env:ProgramFiles\Datadog\Datadog Agent\embedded3\python" -m pip install grpcio pynvml==11.5.3
   ```

1. Configure your integration similar to core [integrations](https://docs.datadoghq.com/getting_started/integrations/).

If you are using Docker, there is an [example Dockerfile](https://github.com/DataDog/integrations-extras/blob/master/nvml/tests/Dockerfile) in the NVML repository.

```shell
docker build -t dd-agent-nvml .
```

If you're using Docker and Kubernetes, you need to expose the environment variables `NVIDIA_VISIBLE_DEVICES` and `NVIDIA_DRIVER_CAPABILITIES`. See the included Dockerfile for an example.

To correlate reserved Kubernetes NVIDIA devices with the Kubernetes pod using the device, mount the Unix domain socket `/var/lib/kubelet/pod-resources/kubelet.sock` into your Agent's configuration. More information about this socket is on the [Kubernetes website](https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#monitoring-device-plugin-resources). **Note**: This device is in beta support for version 1.15.

### Configuration{% #configuration %}

1. Edit the `nvml.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your NVML performance data. See the [sample nvml.d/conf.yaml](https://github.com/DataDog/integrations-extras/blob/master/nvml/datadog_checks/nvml/data/conf.yaml.example) for all available configuration options.

1. [Restart the Agent](https://docs.datadoghq.com/agent/guide/agent-commands/#start-stop-and-restart-the-agent).

### Validation{% #validation %}

[Run the Agent's status subcommand](https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information) and look for `nvml` under the Checks section.

## Data Collected{% #data-collected %}

### Metrics{% #metrics %}

|  |
|  |
| **nvml.device\_count**(gauge)              | Number of GPU on this instance.                                                                                              |
| **nvml.gpu\_utilization**(gauge)           | Percent of time over the past sample period during which one or more kernels was executing on the GPU.*Shown as percent*     |
| **nvml.mem\_copy\_utilization**(gauge)     | Percent of time over the past sample period during which global (device) memory was being read or written.*Shown as percent* |
| **nvml.fb\_free**(gauge)                   | Unallocated FB memory.*Shown as byte*                                                                                        |
| **nvml.fb\_used**(gauge)                   | Allocated FB memory.*Shown as byte*                                                                                          |
| **nvml.fb\_total**(gauge)                  | Total installed FB memory.*Shown as byte*                                                                                    |
| **nvml.power\_usage**(gauge)               | Power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)                                            |
| **nvml.total\_energy\_consumption**(count) | Total energy consumption for this GPU in millijoules (mJ) since the driver was last reloaded                                 |
| **nvml.enc\_utilization**(gauge)           | The current utilization for the Encoder*Shown as percent*                                                                    |
| **nvml.dec\_utilization**(gauge)           | The current utilization for the Decoder*Shown as percent*                                                                    |
| **nvml.pcie\_tx\_throughput**(gauge)       | PCIe TX utilization*Shown as kibibyte*                                                                                       |
| **nvml.pcie\_rx\_throughput**(gauge)       | PCIe RX utilization*Shown as kibibyte*                                                                                       |
| **nvml.temperature**(gauge)                | Current temperature for this GPU in degrees celsius                                                                          |
| **nvml.fan\_speed**(gauge)                 | The current utilization for the fan*Shown as percent*                                                                        |
| **nvml.compute\_running\_process**(gauge)  | The current usage of gpu memory by process*Shown as byte*                                                                    |

There is an attempt to, when possible, match metric names with NVIDIA's [Data Center GPU Manager (DCGM) exporter](https://github.com/NVIDIA/dcgm-exporter).

### Events{% #events %}

NVML does not include any events.

### Service Checks{% #service-checks %}

See [service_checks.json](https://github.com/DataDog/integrations-extras/blob/master/nvml/assets/service_checks.json) for a list of service checks provided by this integration.

## Troubleshooting{% #troubleshooting %}

Need help? Contact [Datadog support](https://docs.datadoghq.com/help).
