Nvidia Triton

Supported OS Linux Windows Mac OS

Integration version2.2.0

Overview

This check monitors Nvidia Triton through the Datadog Agent.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The Nvidia Triton check is included in the Datadog Agent package. No additional installation is needed on your server.

OpenMetrics endpoint

By default, the Nvidia Triton server exposes all metrics through the Prometheus endpoint. To enable all metrics reportings:

tritonserver --allow-metrics=true

To change the metric endpoint, use the --metrics-address option.

Example:

tritonserver --metrics-address=http://0.0.0.0:8002

In this case, the OpenMetrics endpoint is exposed at this URL: http://<NVIDIA_TRITON_ADDRESS>:8002/metrics.

The latency summary metrics are disabled by default. To enable summary metrics for latencies, use the command below:

tritonserver --metrics-config summary_latencies=true

The response cache metrics are not reported by default. You need to enable a cache implementation on the server side by specifying a <cache_implementation> and corresponding configuration.

For instance:

tritonserver --cache-config local,size=1048576

Nvidia Triton also offers the possibility to expose custom metrics through their Openemtrics endpoint. Datadog can also collect these custom metrics using the extra_metrics option.

These custom Nvidia Triton metrics are considered standard metrics in Datadog.

Configuration

Edit the nvidia_triton.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your nvidia_triton performance data. See the sample nvidia_triton.d/conf.yaml for all available configuration options.
Restart the Agent.

Validation

Run the Agent’s status subcommand and look for nvidia_triton under the Checks section.

Data Collected

Metrics


nvidia_triton.cache.insertion.duration (gauge)	Total cache insertion duration, in microseconds Shown as microsecond
nvidia_triton.cache.lookup.duration (gauge)	Total cache lookup duration (hit and miss), in microseconds Shown as microsecond
nvidia_triton.cache.num.entries (gauge)	Number of responses stored in response cache
nvidia_triton.cache.num.evictions (gauge)	Number of cache evictions in response cache
nvidia_triton.cache.num.hits (gauge)	Number of cache hits in response cache
nvidia_triton.cache.num.lookups (gauge)	Number of cache lookups in response cache
nvidia_triton.cache.num.misses (gauge)	Number of cache misses in response cache
nvidia_triton.cache.util (gauge)	Cache utilization [0.0 - 1.0]
nvidia_triton.cpu.memory.total_bytes (gauge)	CPU total memory (RAM), in bytes Shown as byte
nvidia_triton.cpu.memory.used_bytes (gauge)	CPU used memory (RAM), in bytes Shown as byte
nvidia_triton.cpu.utilization (gauge)	CPU utilization rate [0.0 - 1.0]
nvidia_triton.energy.consumption.count (count)	GPU energy consumption in joules since the Triton Server started
nvidia_triton.gpu.memory.total_bytes (gauge)	GPU total memory, in bytes Shown as byte
nvidia_triton.gpu.memory.used_bytes (gauge)	GPU used memory, in bytes Shown as byte
nvidia_triton.gpu.power.limit (gauge)	GPU power management limit in watts Shown as watt
nvidia_triton.gpu.power.usage (gauge)	GPU power usage in watts Shown as watt
nvidia_triton.gpu.utilization (gauge)	GPU utilization rate [0.0 - 1.0)
nvidia_triton.inference.compute.infer.duration_us.count (count)	Cumulative compute inference duration in microseconds (does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.infer.summary_us.count (count)	Cumulative compute inference duration in microseconds (count) (does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.infer.summary_us.quantile (gauge)	Cumulative compute inference duration in microseconds (quantile)(does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.infer.summary_us.sum (count)	Cumulative compute inference duration in microseconds (sum) (does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.input.duration_us.count (count)	Cumulative compute input duration in microseconds (does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.input.summary_us.count (count)	Cumulative compute input duration in microseconds (sum) (does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.input.summary_us.quantile (gauge)	Cumulative compute input duration in microseconds (quantile) (does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.input.summary_us.sum (count)	Cumulative compute input duration in microseconds (count) (does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.output.duration_us.count (count)	Cumulative inference compute output duration in microseconds (does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.output.summary_us.count (count)	Cumulative inference compute output duration in microseconds (count) (does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.output.summary_us.quantile (gauge)	Cumulative inference compute output duration in microseconds (quantile) (does not include cached requests) Shown as microsecond
nvidia_triton.inference.compute.output.summary_us.sum (count)	Cumulative inference compute output duration in microseconds (sum) (does not include cached requests) Shown as microsecond
nvidia_triton.inference.count.count (count)	Number of inferences performed (does not include cached requests)
nvidia_triton.inference.exec.count.count (count)	Number of model executions performed (does not include cached requests)
nvidia_triton.inference.pending.request.count (gauge)	Instantaneous number of pending requests awaiting execution per-model.
nvidia_triton.inference.queue.duration_us.count (count)	Cumulative inference queuing duration in microseconds (includes cached requests) Shown as microsecond
nvidia_triton.inference.queue.summary_us.count (count)	Summary of inference queuing duration in microseconds (count) (includes cached requests) Shown as microsecond
nvidia_triton.inference.queue.summary_us.quantile (gauge)	Summary of inference queuing duration in microseconds (quantile) (includes cached requests) Shown as microsecond
nvidia_triton.inference.queue.summary_us.sum (count)	Summary of inference queuing duration in microseconds (sum) (includes cached requests) Shown as microsecond
nvidia_triton.inference.request.duration_us.count (count)	Cumulative inference request duration in microseconds (includes cached requests) Shown as microsecond
nvidia_triton.inference.request.summary_us.count (count)	Summary of inference request duration in microseconds (count) (includes cached requests) Shown as microsecond
nvidia_triton.inference.request.summary_us.quantile (gauge)	Summary of inference request duration in microseconds (quantile) (includes cached requests) Shown as microsecond
nvidia_triton.inference.request.summary_us.sum (count)	Summary of inference request duration in microseconds (sum) (includes cached requests) Shown as microsecond
nvidia_triton.inference.request_failure.count (count)	Number of failed inference requests, all batch sizes
nvidia_triton.inference.request_success.count (count)	Number of successful inference requests, all batch sizes

Events

The Nvidia Triton integration does not include any events.

Service Checks

nvidia_triton.openmetrics.health

Returns CRITICAL if the Agent is unable to connect to the Nvidia Triton OpenMetrics endpoint, otherwise returns OK.

Statuses: ok, critical

nvidia_triton.health.status

Returns CRITICAL if the Server is having a 4xx or 5xx response, OK if the response is 200, and unknown for everything else.

Statuses: ok, warning, critical

Logs

The Nvidia Triton integration can collect logs from the Nvidia Triton server and forward them to Datadog.

Collecting logs is disabled by default in the Datadog Agent. Enable it in your datadog.yaml file:
```
logs_enabled: true
```
Uncomment and edit the logs configuration block in your nvidia_triton.d/conf.yaml file. Here’s an example:
```
logs:
  - type: docker
    source: nvidia_triton
    service: nvidia_triton
```

Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes Log Collection.

Then, set Log Integrations as pod annotations. This can also be configured with a file, a configmap, or a key-value store. For more information, see the configuration section of Kubernetes Log Collection.

Annotations v1/v2

apiVersion: v1
kind: Pod
metadata:
  name: nvidia_triton
  annotations:
    ad.datadoghq.com/apache.logs: '[{"source":"nvidia_triton","service":"nvidia_triton"}]'
spec:
  containers:
    - name: ray

Troubleshooting

Need help? Contact Datadog support.