---
title: TorchServe
description: Monitor the health and performance of TorchServe
breadcrumbs: Docs > Integrations > TorchServe
---

# TorchServe
Supported OS Integration version4.3.0
## Overview{% #overview %}

This check monitors [TorchServe](https://pytorch.org/serve/) through the Datadog Agent.

**Minimum Agent version:** 7.47.0

## Setup{% #setup %}

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the [Autodiscovery Integration Templates](https://docs.datadoghq.com/agent/kubernetes/integrations/) for guidance on applying these instructions.

### Installation{% #installation %}

Starting from Agent release 7.47.0, the TorchServe check is included in the [Datadog Agent](https://app.datadoghq.com/account/settings/agent/latest) package. No additional installation is needed on your server.

{% alert level="warning" %}
This check uses [OpenMetrics](https://docs.datadoghq.com/integrations/openmetrics/) to collect metrics from the OpenMetrics endpoint TorchServe can expose, which requires Python 3.
{% /alert %}

### Prerequisites{% #prerequisites %}

The TorchServe check collects TorchServe's metrics and performance data using three different endpoints:

- The [Inference API](https://pytorch.org/serve/inference_api.html) to collect the overall health status of your TorchServe instance.
- The [Management API](https://pytorch.org/serve/management_api.html) to collect metrics on the various models you are running.
- The [OpenMetrics endpoint](https://pytorch.org/serve/metrics_api.html) exposed by TorchServe.

You can configure these endpoints using the `config.properties` file, as described in [the TorchServe documentation](https://pytorch.org/serve/configuration.html#configure-torchserve-listening-address-and-port). For example:

```gdscript3
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
metrics_mode=prometheus
number_of_netty_threads=32
default_workers_per_model=10
job_queue_size=1000
model_store=/home/model-server/model-store
workflow_store=/home/model-server/wf-store
load_models=all
```

This configuration file exposes the three different endpoints that can be used by the integration to monitor your instance.

#### OpenMetrics endpoint{% #openmetrics-endpoint %}

To enable the Prometheus endpoint, you need to configure two options:

- `metrics_address`: Metrics API binding address. Defaults to `http://127.0.0.1:8082`
- `metrics_mode`: Two metric modes are supported by TorchServe: `log` and `prometheus`. Defaults to `log`. You have to set it to `prometheus` to collect metrics from this endpoint.

For instance:

```
metrics_address=http://0.0.0.0:8082
metrics_mode=prometheus
```

In this case, the OpenMetrics endpoint is exposed at this URL: `http://<TORCHSERVE_ADDRESS>:8082/metrics`.

### Configuration{% #configuration %}

These three different endpoints can be monitored independently and must be configured separately in the configuration file, one API per instance. See the [sample torchserve.d/conf.yaml](https://github.com/DataDog/integrations-core/blob/master/torchserve/datadog_checks/torchserve/data/conf.yaml.example) for all available configuration options.

{% tab title="OpenMetrics endpoint" %}
#### Configure the OpenMetrics endpoint{% #configure-the-openmetrics-endpoint %}

Configuration options for the OpenMetrics endpoint can be found in the configuration file under the `TorchServe OpenMetrics endpoint configuration` section. The minimal configuration only requires the `openmetrics_endpoint` option:

```yaml
init_config:
  ...
instances:
  - openmetrics_endpoint: http://<TORCHSERVE_ADDRESS>:8082/metrics
```

For more options, see the [sample `torchserve.d/conf.yaml` file](https://github.com/DataDog/integrations-core/blob/master/torchserve/datadog_checks/torchserve/data/conf.yaml.example).

TorchServe allows the custom service code to emit [metrics that will be available based on the configured `metrics_mode`](https://pytorch.org/serve/metrics.html#custom-metrics-api). You can configure this integration to collect these metrics using the `extra_metrics` option. These metrics will have the `torchserve.openmetrics` prefix, just like any other metrics coming from this endpoint.

{% alert level="info" %}
These custom TorchServe metrics are considered standard metrics in Datadog.
{% /alert %}

{% /tab %}

{% tab title="Inference API" %}
#### Configure the Inference API{% #configure-the-inference-api %}

This integration relies on the [Inference API](https://pytorch.org/serve/inference_api.html) to get the overall status of your TorchServe instance. Configuration options for the Inference API can be found in the [configuration file](https://github.com/DataDog/integrations-core/blob/master/torchserve/datadog_checks/torchserve/data/conf.yaml.example) under the `TorchServe Inference API endpoint configuration` section. The minimal configuration only requires the `inference_api_url` option:

```yaml
init_config:
  ...
instances:
  - inference_api_url: http://<TORCHSERVE_ADDRESS>:8080
```

This integration leverages the [Ping endpoint](https://pytorch.org/serve/inference_api.html#health-check-api) to collect the overall health status of your TorchServe server.
{% /tab %}

{% tab title="Management API" %}
#### Configure the Management API{% #configure-the-management-api %}

You can collect metrics related to the models that are currently running in your TorchServe server using the [Management API](https://pytorch.org/serve/management_api.html). Configuration options for the Inference API can be found in the [configuration file](https://github.com/DataDog/integrations-core/blob/master/torchserve/datadog_checks/torchserve/data/conf.yaml.example) under the `TorchServe Management API endpoint configuration` section. The minimal configuration only requires the `management_api_url` option:

```yaml
init_config:
  ...
instances:
  - management_api_url: http://<TORCHSERVE_ADDRESS>:8081
```

By default, the integration collects data from every single models, up to 100 models. This can be modified using the `limit`, `include`, and `exclude` options. For example:

```yaml
init_config:
  ...
instances:
  - management_api_url: http://<TORCHSERVE_ADDRESS>:8081
    limit: 25
    include: 
      - my_model.* 
```

This configuration only collects metrics for model names that match the `my_model.*` regular expression, up to 25 models.

You can also exclude some models:

```yaml
init_config:
  ...
instances:
  - management_api_url: http://<TORCHSERVE_ADDRESS>:8081
    exclude: 
      - test.* 
```

This configuration collects metrics for every model name that does not match the `test.*` regular expression, up to 100 models.

{% alert level="info" %}
You can use the `include` and `exclude` options in the same configuration. The `exclude` filters are applied after the `include` ones.
{% /alert %}

By default, the integration retrieves the full list of the models every time the check runs. You can cache this list by using the `interval` option for increased performance of this check.

{% alert level="warning" %}
Using the `interval` option can also delay some metrics and events.
{% /alert %}

{% /tab %}

#### Complete configuration{% #complete-configuration %}

{% tab title="Host" %}
This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections:

```yaml
init_config:
  ...
instances:
  - openmetrics_endpoint: http://<TORCHSERVE_ADDRESS>:8082/metrics
    # Also collect your own TorchServe metrics
    extra_metrics:
      - my_custom_torchserve_metric
  - inference_api_url: http://<TORCHSERVE_ADDRESS>:8080
  - management_api_url: http://<TORCHSERVE_ADDRESS>:8081
    # Include all the model names that match this regex   
    include:
      - my_models.*
    # But exclude all the ones that finish with `-test`
    exclude: 
      - .*-test 
    # Refresh the list of models only every hour
    interval: 3600
```

[Restart the Agent](https://docs.datadoghq.com/agent/guide/agent-commands/#start-stop-and-restart-the-agent) after modifying the configuration.
{% /tab %}

{% tab title="Docker" %}
This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections as a Docker label inside `docker-compose.yml`:

```yaml
labels:
  com.datadoghq.ad.checks: '{"torchserve":{"instances":[{"openmetrics_endpoint":"http://%%host%%:8082/metrics","extra_metrics":["my_custom_torchserve_metric"]},{"inference_api_url":"http://%%host%%:8080"},{"management_api_url":"http://%%host%%:8081","include":["my_models.*"],"exclude":[".*-test"],"interval":3600}]}}'
```

{% /tab %}

{% tab title="Kubernetes" %}
This example demonstrates the complete configuration leveraging the three different APIs described in the previous sections as Kubernetes annotations on your Torchserve pods:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: '<POD_NAME>'
  annotations:
    ad.datadoghq.com/torchserve.checks: |-
      {
        "torchserve": {
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:8082/metrics",
              "extra_metrics": [
                "my_custom_torchserve_metric"
              ]
            },
            {
              "inference_api_url": "http://%%host%%:8080"
            },
            {
              "management_api_url": "http://%%host%%:8081",
              "include": [
                ".*"
              ],
              "exclude": [
                ".*-test"
              ],
              "interval": 3600
            }
          ]
        }
      }
    # (...)
spec:
  containers:
    - name: 'torchserve'
# (...)
```

{% /tab %}

### Validation{% #validation %}

[Run the Agent's status subcommand](https://docs.datadoghq.com/agent/guide/agent-commands/#agent-status-and-information) and look for `torchserve` under the Checks section.

## Data Collected{% #data-collected %}

### Metrics{% #metrics %}

|  |
|  |
| **torchserve.management\_api.model.batch\_size**(gauge)             | Maximum batch size that a model is expected to handle.                                                              |
| **torchserve.management\_api.model.is\_loaded\_at\_startup**(gauge) | Whether or not the model was loaded when TorchServe started. 1 if true, 0 otherwise.                                |
| **torchserve.management\_api.model.max\_batch\_delay**(gauge)       | The maximum batch delay time in ms TorchServe waits to receive batch_size number of requests.*Shown as millisecond* |
| **torchserve.management\_api.model.version.is\_default**(gauge)     | Whether or not this version of the model is the default one. 1 if true, 0 otherwise.                                |
| **torchserve.management\_api.model.versions**(gauge)                | Total number of versions for a given model.                                                                         |
| **torchserve.management\_api.model.worker.is\_gpu**(gauge)          | Whether or not this worker is using a GPU. 1 if true, 0 otherwise.                                                  |
| **torchserve.management\_api.model.worker.memory\_usage**(gauge)    | Memory used by the worker in byte.*Shown as byte*                                                                   |
| **torchserve.management\_api.model.worker.status**(gauge)           | The status of a given worker. 1 if ready, 2 if loading, 3 if unloading, 0 otherwise.                                |
| **torchserve.management\_api.model.workers.current**(gauge)         | Current number of workers of a given model.                                                                         |
| **torchserve.management\_api.model.workers.max**(gauge)             | Maximum number of workers defined of a given model.                                                                 |
| **torchserve.management\_api.model.workers.min**(gauge)             | Minimum number of workers defined of a given model.                                                                 |
| **torchserve.management\_api.models**(gauge)                        | Total number of models.                                                                                             |
| **torchserve.openmetrics.cpu.utilization**(gauge)                   | CPU utilization on host.*Shown as percent*                                                                          |
| **torchserve.openmetrics.disk.available**(gauge)                    | Disk available on host.*Shown as gigabyte*                                                                          |
| **torchserve.openmetrics.disk.used**(gauge)                         | Memory used on host.*Shown as gigabyte*                                                                             |
| **torchserve.openmetrics.disk.utilization**(gauge)                  | Disk utilization on host.*Shown as percent*                                                                         |
| **torchserve.openmetrics.gpu.memory.used**(gauge)                   | GPU memory used on host.*Shown as megabyte*                                                                         |
| **torchserve.openmetrics.gpu.memory.utilization**(gauge)            | GPU memory utilization on host.*Shown as percent*                                                                   |
| **torchserve.openmetrics.gpu.utilization**(gauge)                   | GPU utilization on host.*Shown as percent*                                                                          |
| **torchserve.openmetrics.handler\_time**(gauge)                     | Time spent in backend handler.*Shown as millisecond*                                                                |
| **torchserve.openmetrics.inference.count**(count)                   | Total number of inference requests received.*Shown as request*                                                      |
| **torchserve.openmetrics.inference.latency.count**(count)           | Total inference latency in Microseconds.*Shown as microsecond*                                                      |
| **torchserve.openmetrics.memory.available**(gauge)                  | Memory available on host.*Shown as megabyte*                                                                        |
| **torchserve.openmetrics.memory.used**(gauge)                       | Memory used on host.*Shown as megabyte*                                                                             |
| **torchserve.openmetrics.memory.utilization**(gauge)                | Memory utilization on host.*Shown as percent*                                                                       |
| **torchserve.openmetrics.prediction\_time**(gauge)                  | Backend prediction time.*Shown as millisecond*                                                                      |
| **torchserve.openmetrics.queue.latency.count**(count)               | Total queue latency in Microseconds.*Shown as microsecond*                                                          |
| **torchserve.openmetrics.queue.time**(gauge)                        | Time spent by a job in request queue in Milliseconds.*Shown as millisecond*                                         |
| **torchserve.openmetrics.requests.2xx.count**(count)                | Total number of requests with response in 200-300 status code range.*Shown as request*                              |
| **torchserve.openmetrics.requests.4xx.count**(count)                | Total number of requests with response in 400-500 status code range.*Shown as request*                              |
| **torchserve.openmetrics.requests.5xx.count**(count)                | Total number of requests with response status code above 500.*Shown as request*                                     |
| **torchserve.openmetrics.worker.load\_time**(gauge)                 | Time taken by worker to load model in Milliseconds.*Shown as millisecond*                                           |
| **torchserve.openmetrics.worker.thread\_time**(gauge)               | Time spent in worker thread excluding backend response time in Milliseconds.*Shown as millisecond*                  |

Metrics are prefixed using the API they are coming from:

- `torchserve.openmetrics.*` for metrics coming from the OpenMetrics endpoint.
- `torchserve.inference_api.*` for metrics coming from the Inference API.
- `torchserve.management_api.*` for metrics coming from the Management API.

### Events{% #events %}

The TorchServe integration include three events using the Management API:

- `torchserve.management_api.model_added`: This event fires when a new model has been added.
- `torchserve.management_api.model_removed`: This event fires when a model has been removed.
- `torchserve.management_api.default_version_changed`: This event fires when a default version has been set for a given model.

{% alert level="info" %}
You can disable the events setting the `submit_events` option to `false` in your [configuration file](https://github.com/DataDog/integrations-core/blob/master/torchserve/datadog_checks/torchserve/data/conf.yaml.example).
{% /alert %}

### Service Checks{% #service-checks %}

**torchserve.openmetrics.health**

Returns `CRITICAL` if the Agent is unable to connect to the OpenMetrics endpoint, otherwise returns `OK`.

*Statuses: ok, critical*

**torchserve.inference\_api.health**

Returns `CRITICAL` if the Agent is unable to connect to the Inference API endpoint or if it is unhealthy, otherwise returns `OK`.

*Statuses: ok, critical*

**torchserve.management\_api.health**

Returns `CRITICAL` if the Agent is unable to connect to the Management API endpoint, otherwise returns `OK`.

*Statuses: ok, critical*

### Logs{% #logs %}

The TorchServe integration can collect logs from the TorchServe service and forward them to Datadog.

1. Collecting logs is disabled by default in the Datadog Agent. Enable it in your `datadog.yaml` file:

   ```yaml
   logs_enabled: true
   ```

1. Uncomment and edit the logs configuration block in your `torchserve.d/conf.yaml` file. Here's an example:

   ```yaml
   logs:
     - type: file
       path: /var/log/torchserve/model_log.log
       source: torchserve
       service: torchserve
     - type: file
       path: /var/log/torchserve/ts_log.log
       source: torchserve
       service: torchserve
   ```

See [the example configuration file](https://github.com/DataDog/integrations-core/blob/master/torchserve/datadog_checks/torchserve/data/conf.yaml.example) on how to collect all logs.

For more information about the logging configuration with TorchServe, see the [official TorchServe documentation](https://pytorch.org/serve/logging.html?highlight=logs).

{% alert level="warning" %}
You can also collect logs from the `access_log.log` file. However, these logs are included in the `ts_log.log` file, leading you to duplicated logs in Datadog if you configure both files.
{% /alert %}

## Troubleshooting{% #troubleshooting %}

Need help? Contact [Datadog support](https://docs.datadoghq.com/help/).
