---
title: Nvidia Triton
description: NVIDIA Triton Inference Server is open source inference-serving software
breadcrumbs: Docs > Integrations > Nvidia Triton
---

# Nvidia Triton
Supported OS Integration version3.4.1
## Overview{% #overview %}

This check monitors [Nvidia Triton](https://www.nvidia.com/en-us/ai-data-science/products/triton-inference-server/) through the Datadog Agent.

**Minimum Agent version:** 7.50.0

## Setup{% #setup %}

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the [Autodiscovery Integration Templates](https://docs.datadoghq.com/agent/kubernetes/integrations.md) for guidance on applying these instructions.

### Installation{% #installation %}

The Nvidia Triton check is included in the [Datadog Agent](https://app.datadoghq.com/account/settings/agent/latest) package. No additional installation is needed on your server.

#### OpenMetrics endpoint{% #openmetrics-endpoint %}

By default, the Nvidia Triton server exposes all metrics through the Prometheus endpoint. To enable all metrics reportings:

```
tritonserver --allow-metrics=true
```

To change the metric endpoint, use the `--metrics-address` option.

Example:

```
tritonserver --metrics-address=http://0.0.0.0:8002
```

In this case, the OpenMetrics endpoint is exposed at this URL: `http://<NVIDIA_TRITON_ADDRESS>:8002/metrics`.

The [latency summary](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/metrics.html#summaries) metrics are disabled by default. To enable summary metrics for latencies, use the command below:

```
tritonserver --metrics-config summary_latencies=true
```

The [response cache metrics](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/metrics.html#response-cache-metrics) are not reported by default. You need to enable a cache implementation on the server side by specifying a <cache_implementation> and corresponding configuration.

For instance:

```
tritonserver --cache-config local,size=1048576
```

Nvidia Triton also offers the possibility to expose [custom metrics](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/metrics.html#custom-metrics) through their Openemtrics endpoint. Datadog can also collect these custom metrics using the `extra_metrics` option.

{% alert level="warning" %}
These custom Nvidia Triton metrics are considered standard metrics in Datadog.
{% /alert %}

### Configuration{% #configuration %}

1. Edit the `nvidia_triton.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your nvidia_triton performance data. See the [sample nvidia_triton.d/conf.yaml](https://github.com/DataDog/integrations-core/blob/master/nvidia_triton/datadog_checks/nvidia_triton/data/conf.yaml.example) for all available configuration options.

1. [Restart the Agent](https://docs.datadoghq.com/agent/guide/agent-commands.md#start-stop-and-restart-the-agent).

### Validation{% #validation %}

[Run the Agent's status subcommand](https://docs.datadoghq.com/agent/guide/agent-commands.md#agent-status-and-information) and look for `nvidia_triton` under the Checks section.

## Data Collected{% #data-collected %}

### Metrics{% #metrics %}

|  |
|  |
| **nvidia\_triton.cache.insertion.duration**(gauge)                      | Total cache insertion duration, in microseconds*Shown as microsecond*                                                            |
| **nvidia\_triton.cache.lookup.duration**(gauge)                         | Total cache lookup duration (hit and miss), in microseconds*Shown as microsecond*                                                |
| **nvidia\_triton.cache.num.entries**(gauge)                             | Number of responses stored in response cache                                                                                     |
| **nvidia\_triton.cache.num.evictions**(gauge)                           | Number of cache evictions in response cache                                                                                      |
| **nvidia\_triton.cache.num.hits**(gauge)                                | Number of cache hits in response cache                                                                                           |
| **nvidia\_triton.cache.num.lookups**(gauge)                             | Number of cache lookups in response cache                                                                                        |
| **nvidia\_triton.cache.num.misses**(gauge)                              | Number of cache misses in response cache                                                                                         |
| **nvidia\_triton.cache.util**(gauge)                                    | Cache utilization [0.0 - 1.0]                                                                                                    |
| **nvidia\_triton.cpu.memory.total\_bytes**(gauge)                       | CPU total memory (RAM), in bytes*Shown as byte*                                                                                  |
| **nvidia\_triton.cpu.memory.used\_bytes**(gauge)                        | CPU used memory (RAM), in bytes*Shown as byte*                                                                                   |
| **nvidia\_triton.cpu.utilization**(gauge)                               | CPU utilization rate [0.0 - 1.0]                                                                                                 |
| **nvidia\_triton.energy.consumption.count**(count)                      | GPU energy consumption in joules since the Triton Server started                                                                 |
| **nvidia\_triton.gpu.memory.total\_bytes**(gauge)                       | GPU total memory, in bytes*Shown as byte*                                                                                        |
| **nvidia\_triton.gpu.memory.used\_bytes**(gauge)                        | GPU used memory, in bytes*Shown as byte*                                                                                         |
| **nvidia\_triton.gpu.power.limit**(gauge)                               | GPU power management limit in watts*Shown as watt*                                                                               |
| **nvidia\_triton.gpu.power.usage**(gauge)                               | GPU power usage in watts*Shown as watt*                                                                                          |
| **nvidia\_triton.gpu.utilization**(gauge)                               | GPU utilization rate [0.0 - 1.0)                                                                                                 |
| **nvidia\_triton.inference.compute.infer.duration\_us.count**(count)    | Cumulative compute inference duration in microseconds (does not include cached requests)*Shown as microsecond*                   |
| **nvidia\_triton.inference.compute.infer.summary\_us.count**(count)     | Cumulative compute inference duration in microseconds (count) (does not include cached requests)*Shown as microsecond*           |
| **nvidia\_triton.inference.compute.infer.summary\_us.quantile**(gauge)  | Cumulative compute inference duration in microseconds (quantile)(does not include cached requests)*Shown as microsecond*         |
| **nvidia\_triton.inference.compute.infer.summary\_us.sum**(count)       | Cumulative compute inference duration in microseconds (sum) (does not include cached requests)*Shown as microsecond*             |
| **nvidia\_triton.inference.compute.input.duration\_us.count**(count)    | Cumulative compute input duration in microseconds (does not include cached requests)*Shown as microsecond*                       |
| **nvidia\_triton.inference.compute.input.summary\_us.count**(count)     | Cumulative compute input duration in microseconds (sum) (does not include cached requests)*Shown as microsecond*                 |
| **nvidia\_triton.inference.compute.input.summary\_us.quantile**(gauge)  | Cumulative compute input duration in microseconds (quantile) (does not include cached requests)*Shown as microsecond*            |
| **nvidia\_triton.inference.compute.input.summary\_us.sum**(count)       | Cumulative compute input duration in microseconds (count) (does not include cached requests)*Shown as microsecond*               |
| **nvidia\_triton.inference.compute.output.duration\_us.count**(count)   | Cumulative inference compute output duration in microseconds (does not include cached requests)*Shown as microsecond*            |
| **nvidia\_triton.inference.compute.output.summary\_us.count**(count)    | Cumulative inference compute output duration in microseconds (count) (does not include cached requests)*Shown as microsecond*    |
| **nvidia\_triton.inference.compute.output.summary\_us.quantile**(gauge) | Cumulative inference compute output duration in microseconds (quantile) (does not include cached requests)*Shown as microsecond* |
| **nvidia\_triton.inference.compute.output.summary\_us.sum**(count)      | Cumulative inference compute output duration in microseconds (sum) (does not include cached requests)*Shown as microsecond*      |
| **nvidia\_triton.inference.count.count**(count)                         | Number of inferences performed (does not include cached requests)                                                                |
| **nvidia\_triton.inference.exec.count.count**(count)                    | Number of model executions performed (does not include cached requests)                                                          |
| **nvidia\_triton.inference.pending.request.count**(gauge)               | Instantaneous number of pending requests awaiting execution per-model.                                                           |
| **nvidia\_triton.inference.queue.duration\_us.count**(count)            | Cumulative inference queuing duration in microseconds (includes cached requests)*Shown as microsecond*                           |
| **nvidia\_triton.inference.queue.summary\_us.count**(count)             | Summary of inference queuing duration in microseconds (count) (includes cached requests)*Shown as microsecond*                   |
| **nvidia\_triton.inference.queue.summary\_us.quantile**(gauge)          | Summary of inference queuing duration in microseconds (quantile) (includes cached requests)*Shown as microsecond*                |
| **nvidia\_triton.inference.queue.summary\_us.sum**(count)               | Summary of inference queuing duration in microseconds (sum) (includes cached requests)*Shown as microsecond*                     |
| **nvidia\_triton.inference.request.duration\_us.count**(count)          | Cumulative inference request duration in microseconds (includes cached requests)*Shown as microsecond*                           |
| **nvidia\_triton.inference.request.summary\_us.count**(count)           | Summary of inference request duration in microseconds (count) (includes cached requests)*Shown as microsecond*                   |
| **nvidia\_triton.inference.request.summary\_us.quantile**(gauge)        | Summary of inference request duration in microseconds (quantile) (includes cached requests)*Shown as microsecond*                |
| **nvidia\_triton.inference.request.summary\_us.sum**(count)             | Summary of inference request duration in microseconds (sum) (includes cached requests)*Shown as microsecond*                     |
| **nvidia\_triton.inference.request\_failure.count**(count)              | Number of failed inference requests, all batch sizes                                                                             |
| **nvidia\_triton.inference.request\_success.count**(count)              | Number of successful inference requests, all batch sizes                                                                         |

### Events{% #events %}

The Nvidia Triton integration does not include any events.

### Service Checks{% #service-checks %}

The Nvidia Triton integration includes two service checks.

**nvidia\_triton.openmetrics.health**

Returns `CRITICAL` if the Agent is unable to connect to the Nvidia Triton OpenMetrics endpoint, otherwise returns `OK`.

*Statuses: ok, critical*

**nvidia\_triton.health.status**

Returns `CRITICAL` if the Server is having a 4xx or 5xx response, `OK` if the response is 200, and `unknown` for everything else.

*Statuses: ok, warning, critical*

### Logs{% #logs %}

The Nvidia Triton integration can collect logs from the Nvidia Triton server and forward them to Datadog.

{% tab title="Host" %}

1. Collecting logs is disabled by default in the Datadog Agent. Enable it in your `datadog.yaml` file:

   ```yaml
   logs_enabled: true
   ```

1. Uncomment and edit the logs configuration block in your `nvidia_triton.d/conf.yaml` file. Here's an example:

   ```yaml
   logs:
     - type: docker
       source: nvidia_triton
       service: nvidia_triton
   ```

{% /tab %}

{% tab title="Kubernetes" %}
Collecting logs is disabled by default in the Datadog Agent. To enable it, see [Kubernetes Log Collection](https://docs.datadoghq.com/agent/kubernetes/log.md#setup).

Then, set Log Integrations as pod annotations. This can also be configured with a file, a configmap, or a key-value store. For more information, see the configuration section of [Kubernetes Log Collection](https://docs.datadoghq.com/agent/kubernetes/log.md#configuration).

**Annotations v1/v2**

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: nvidia_triton
  annotations:
    ad.datadoghq.com/apache.logs: '[{"source":"nvidia_triton","service":"nvidia_triton"}]'
spec:
  containers:
    - name: ray
```

{% /tab %}

## Troubleshooting{% #troubleshooting %}

Need help? Contact [Datadog support](https://docs.datadoghq.com/help/).
