Nvidia NVML

문서 > 통합 > Nvidia NVML

Supported OS Linux Windows Mac OS

통합 버전1.0.9

개요

본 점검은 Datadog 에이전트를 통해 노출된 NVIDIA 관리 라이브러리(NVML) 메트릭을 모니터링하고, 노출된 쿠버네티스(Kubernetes) 기기와 상호 연결할 수 있습니다.

설정

NVML 점검은 Datadog 에이전트 패키지에 포함되어 있지 않기 때문에 설치해야 합니다.

설치

에이전트 v7.21+/v6.21+의 경우, 하단 지침에 따라 호스트에 따라 NVML 점검을 설치하세요. 도커(Docker)에이전트 또는 이전 버전의 에이전트와 같이 설치하려면 커뮤니티 통합 사용을 참고하세요.

다음 명령어를 실행해 에이전트 통합을 설치하세요.

Linux의 경우:

datadog-agent integration install -t datadog-nvml==<INTEGRATION_VERSION>
# You may also need to install dependencies since those aren't packaged into the wheel
sudo -u dd-agent -H /opt/datadog-agent/embedded/bin/pip3 install grpcio pynvml

윈도우즈(Windows)(관리자 권한으로 실행하는 Powershell 사용)의 경우:

& "$env:ProgramFiles\Datadog\Datadog Agent\bin\agent.exe" integration install -t datadog-nvml==<INTEGRATION_VERSION>
# You may also need to install dependencies since those aren't packaged into the wheel
& "$env:ProgramFiles\Datadog\Datadog Agent\embedded3\python" -m pip install grpcio pynvml

통합을 코어 통합과 유사하게 설정하세요.

도커(Docker)를 사용하는 경우, NVML 리포지토리에 Dockerfile 예제가 있습니다.

docker build -t dd-agent-nvml .

도커(Docker) 및 쿠버네티스(Kubernetes)를 사용하는 경우 환경 변수 NVIDIA_VISIBLE_DEVICES 및 NVIDIA_DRIVER_CAPABILITIES를 노출해야 합니다. 포함된 Dockerfile의 예제를 참조하세요.

본 기기를 사용해야여 예약된 쿠버네티스(Kubernetes) NVIDIA 기기를 쿠버네티스(Kubernetes) 포드에 연결하려면, Unix 도메인 소켓 /var/lib/kubelet/pod-resources/kubelet.sock을 에이전트 설정에 마운트합니다. 본 소켓에 대한 자세한 정보는 쿠버네티스(Kubernetes) 웹사이트에서 확인할 수 있습니다. 참고: 본 기기는 버전 1.15 베타 서비스입니다.

구성

에이전트 설정 디렉터리 루트의 conf.d/ 폴더에서 nvml.d/conf.yaml 파일을 편집하여 NVML 성능 데이터 수집을 시작합니다. 사용 가능한 모든 설정 옵션은 nvml.d/conf.yaml 샘플을 참조하세요.
Agent를 재시작합니다.

검증

에이전트 상태 하위 명령 실행을 통해 점검 섹션에서 nvml를 찾습니다.

수집한 데이터

메트릭


nvml.device_count (gauge)	Number of GPU on this instance.
nvml.gpu_utilization (gauge)	Percent of time over the past sample period during which one or more kernels was executing on the GPU. Shown as percent
nvml.mem_copy_utilization (gauge)	Percent of time over the past sample period during which global (device) memory was being read or written. Shown as percent
nvml.fb_free (gauge)	Unallocated FB memory. Shown as byte
nvml.fb_used (gauge)	Allocated FB memory. Shown as byte
nvml.fb_total (gauge)	Total installed FB memory. Shown as byte
nvml.power_usage (gauge)	Power usage for this GPU in milliwatts and its associated circuitry (e.g. memory)
nvml.total_energy_consumption (count)	Total energy consumption for this GPU in millijoules (mJ) since the driver was last reloaded
nvml.enc_utilization (gauge)	The current utilization for the Encoder Shown as percent
nvml.dec_utilization (gauge)	The current utilization for the Decoder Shown as percent
nvml.pcie_tx_throughput (gauge)	PCIe TX utilization Shown as kibibyte
nvml.pcie_rx_throughput (gauge)	PCIe RX utilization Shown as kibibyte
nvml.temperature (gauge)	Current temperature for this GPU in degrees celsius
nvml.fan_speed (gauge)	The current utilization for the fan Shown as percent
nvml.compute_running_process (gauge)	The current usage of gpu memory by process Shown as byte

권한 있는 메트릭 문서는 NVIDIA 웹사이트에서 확인할 수 있습니다.

가능하다면 메트릭 이름을 NVIDIA 데이터 센터 GPU 관리자(DCGM) 익스포터와 일치시키려고 시도합니다.

이벤트

NVML에는 이벤트가 포함되어 있지 않습니다.

서비스 점검

See service_checks.json for a list of service checks provided by this integration.

트러블슈팅

도움이 필요하신가요? Datadog 지원 팀에 문의하세요.