- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- Administrator's Guide
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Supported OS
This check monitors NVIDIA Management Library (NVML) exposed metrics through the Datadog Agent and can correlate them with the exposed Kubernetes devices.
The NVML check is not included in the Datadog Agent package, so you need to install it.
For Agent v7.21+ / v6.21+, follow the instructions below to install the NVML check on your host. See Use Community Integrations to install with the Docker Agent or earlier versions of the Agent.
Run the following command to install the Agent integration:
For Linux:
datadog-agent integration install -t datadog-nvml==<INTEGRATION_VERSION>
# You may also need to install dependencies since those aren't packaged into the wheel
sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml
For Windows (Using Powershell run as admin):
& "$env:ProgramFiles\Datadog\Datadog Agent\bin\agent.exe" integration install -t datadog-nvml==<INTEGRATION_VERSION>
# You may also need to install dependencies since those aren't packaged into the wheel
& "$env:ProgramFiles\Datadog\Datadog Agent\embedded3\python" -m pip install grpcio pynvml
Configure your integration similar to core integrations.
If you are using Docker, there is an example Dockerfile in the NVML repository.
docker build -t dd-agent-nvml .
If you’re using Docker and Kubernetes, you need to expose the environment variables NVIDIA_VISIBLE_DEVICES
and NVIDIA_DRIVER_CAPABILITIES
. See the included Dockerfile for an example.
To correlate reserved Kubernetes NVIDIA devices with the Kubernetes pod using the device, mount the Unix domain socket /var/lib/kubelet/pod-resources/kubelet.sock
into your Agent’s configuration. More information about this socket is on the Kubernetes website. Note: This device is in beta support for version 1.15.
Edit the nvml.d/conf.yaml
file, in the conf.d/
folder at the root of your Agent’s configuration directory to start collecting your NVML performance data. See the sample nvml.d/conf.yaml for all available configuration options.
Run the Agent’s status subcommand and look for nvml
under the Checks section.
nvml.device_count (gauge) | Number of GPU on this instance. |
nvml.gpu_utilization (gauge) | Percent of time over the past sample period during which one or more kernels was executing on the GPU. Shown as percent |
nvml.mem_copy_utilization (gauge) | Percent of time over the past sample period during which global (device) memory was being read or written. Shown as percent |
nvml.fb_free (gauge) | Unallocated FB memory. Shown as byte |
nvml.fb_used (gauge) | Allocated FB memory. Shown as byte |
nvml.fb_total (gauge) | Total installed FB memory. Shown as byte |
nvml.power_usage (gauge) | Power usage for this GPU in milliwatts and its associated circuitry (e.g. memory) |
nvml.total_energy_consumption (count) | Total energy consumption for this GPU in millijoules (mJ) since the driver was last reloaded |
nvml.enc_utilization (gauge) | The current utilization for the Encoder Shown as percent |
nvml.dec_utilization (gauge) | The current utilization for the Decoder Shown as percent |
nvml.pcie_tx_throughput (gauge) | PCIe TX utilization Shown as kibibyte |
nvml.pcie_rx_throughput (gauge) | PCIe RX utilization Shown as kibibyte |
nvml.temperature (gauge) | Current temperature for this GPU in degrees celsius |
nvml.fan_speed (gauge) | The current utilization for the fan Shown as percent |
nvml.compute_running_process (gauge) | The current usage of gpu memory by process Shown as byte |
There is an attempt to, when possible, match metric names with NVIDIA’s Data Center GPU Manager (DCGM) exporter.
NVML does not include any events.
Need help? Contact Datadog support.