- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- Administrator's Guide
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Supported OS
This check monitors NVIDIA NIM through the Datadog Agent.
Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.
Requirements:
The NVIDIA NIM check is included in the Datadog Agent package. No additional installation is needed on your server.
NVIDIA NIM provides Prometheus metrics indicating request statistics. By default, these metrics are available at http://localhost:8000/metrics. The Datadog Agent can collect the exposed metrics using this integration. Follow the instructions below to configure data collection from any or all of the components.
To start collecting your NVIDIA NIM performance data:
Edit the nvidia_nim.d/conf.yaml
file, in the conf.d/
folder at the root of your Agent’s configuration directory to start collecting your NVIDIA NIM performance data. See the sample nvidia_nim.d/conf.yaml for all available configuration options.
Run the Agent’s status subcommand and look for nvidia_nim
under the Checks section.
nvidia_nim.e2e_request_latency.seconds.bucket (count) | The observations of end to end request latency bucketed by seconds. |
nvidia_nim.e2e_request_latency.seconds.count (count) | The total number of observations of end to end request latency. |
nvidia_nim.e2e_request_latency.seconds.sum (count) | The sum of end to end request latency in seconds. Shown as second |
nvidia_nim.generation_tokens.count (count) | Number of generation tokens processed. Shown as token |
nvidia_nim.gpu_cache_usage_percent (gauge) | GPU KV-cache usage. 1 means 100 percent usage. Shown as fraction |
nvidia_nim.num_request.max (gauge) | The max number of concurrently running requests. Shown as request |
nvidia_nim.num_requests.running (gauge) | Number of requests currently running on GPU. Shown as request |
nvidia_nim.num_requests.waiting (gauge) | Number of requests waiting. Shown as request |
nvidia_nim.process.cpu_seconds.count (count) | Total user and system CPU time spent in seconds. Shown as second |
nvidia_nim.process.max_fds (gauge) | Maximum number of open file descriptors. Shown as file |
nvidia_nim.process.open_fds (gauge) | Number of open file descriptors. Shown as file |
nvidia_nim.process.resident_memory_bytes (gauge) | Resident memory size in bytes. Shown as byte |
nvidia_nim.process.start_time_seconds (gauge) | Time in seconds since process started. Shown as second |
nvidia_nim.process.virtual_memory_bytes (gauge) | Virtual memory size in bytes. Shown as byte |
nvidia_nim.prompt_tokens.count (count) | Number of prefill tokens processed. Shown as token |
nvidia_nim.python.gc.collections.count (count) | Number of times this generation was collected. |
nvidia_nim.python.gc.objects.collected.count (count) | Objects collected during GC. |
nvidia_nim.python.gc.objects.uncollectable.count (count) | Uncollectable objects found during GC. |
nvidia_nim.python.info (gauge) | Python platform information. |
nvidia_nim.request.failure.count (count) | The count of failed requests. Shown as request |
nvidia_nim.request.finish.count (count) | The count of finished requests. Shown as request |
nvidia_nim.request.generation_tokens.bucket (count) | Number of generation tokens processed. |
nvidia_nim.request.generation_tokens.count (count) | Number of generation tokens processed. |
nvidia_nim.request.generation_tokens.sum (count) | Number of generation tokens processed. Shown as token |
nvidia_nim.request.prompt_tokens.bucket (count) | Number of prefill tokens processed. |
nvidia_nim.request.prompt_tokens.count (count) | Number of prefill tokens processed. |
nvidia_nim.request.prompt_tokens.sum (count) | Number of prefill tokens processed. Shown as token |
nvidia_nim.request.success.count (count) | Count of successfully processed requests. |
nvidia_nim.time_per_output_token.seconds.bucket (count) | The observations of time per output token bucketed by seconds. |
nvidia_nim.time_per_output_token.seconds.count (count) | The total number of observations of time per output token. |
nvidia_nim.time_per_output_token.seconds.sum (count) | The sum of time per output token in seconds. Shown as second |
nvidia_nim.time_to_first_token.seconds.bucket (count) | The observations of time to first token bucketed by seconds. |
nvidia_nim.time_to_first_token.seconds.count (count) | The total number of observations of time to first token. |
nvidia_nim.time_to_first_token.seconds.sum (count) | The sum of time to first token in seconds. Shown as second |
The NVIDIA NIM integration does not include any events.
nvidia_nim.openmetrics.health
Returns CRITICAL
if the Agent is unable to connect to the NVIDIA NIM OpenMetrics endpoint, otherwise returns OK
.
Statuses: ok, critical
Need help? Contact Datadog support.