---
title: AWS Inferentia and AWS Trainium Monitoring
description: >-
  Monitor the performance and usage of AWS Inferentia/Trainium instances and the
  Neuron SDK.
breadcrumbs: Docs > Integrations > AWS Inferentia and AWS Trainium Monitoring
---

# AWS Inferentia and AWS Trainium Monitoring
Supported OS Integration version3.4.1
{% callout %}
# Important note for users on the following Datadog sites: us2.ddog-gov.com

{% alert level="info" %}
To find out if this integration is available in your organization, see your [Datadog Integrations](https://app.datadoghq.com/integrations) page or ask your organization administrator.

To initiate an exception request to enable this integration for your organization, email [support@ddog-gov.com](mailto:support@ddog-gov.com).
{% /alert %}

{% /callout %}

## Overview{% #overview %}

This check monitors [AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/index.html) through the Datadog Agent. It enables monitoring of the Inferentia and Trainium devices and delivers insights into your machine learning model's performance.

**Minimum Agent version:** 7.57.0

## Setup{% #setup %}

Follow the instructions below to install and configure this check for an Agent running on an EC2 instance. For containerized environments, see the [Autodiscovery Integration Templates](https://docs.datadoghq.com/agent/kubernetes/integrations.md) for guidance on applying these instructions.

### Installation{% #installation %}

The AWS Neuron check is included in the [Datadog Agent](https://app.datadoghq.com/account/settings/agent/latest) package.

You also need to install the [AWS Neuron Tools](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/index.html) package.

No additional installation is needed on your server.

### Configuration{% #configuration %}

#### Metrics{% #metrics %}

1. Ensure that [Neuron Monitor](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/tools/neuron-sys-tools/neuron-monitor-user-guide.html#using-neuron-monitor-prometheus-py) is being used to expose the Prometheus endpoint.

1. Edit the `aws_neuron.d/conf.yaml` file, which is located in the `conf.d/` folder at the root of your [Agent's configuration directory](https://docs.datadoghq.com/agent/configuration/agent-configuration-files.md#agent-configuration-directory), to start collecting your AWS Neuron performance data. See the [sample aws_neuron.d/conf.yaml](https://github.com/DataDog/integrations-core/blob/master/aws_neuron/datadog_checks/aws_neuron/data/conf.yaml.example) for all available configuration options.

1. [Restart the Agent](https://docs.datadoghq.com/agent/guide/agent-commands.md#start-stop-and-restart-the-agent).

#### Logs{% #logs %}

The AWS Neuron integration can collect logs from the Neuron containers and forward them to Datadog.

{% tab title="Host" %}

1. Collecting logs is disabled by default in the Datadog Agent. Enable it in your `datadog.yaml` file:

   ```yaml
   logs_enabled: true
   ```

1. Uncomment and edit the logs configuration block in your `aws_neuron.d/conf.yaml` file. Here's an example:

   ```yaml
   logs:
     - type: docker
       source: aws_neuron
       service: aws_neuron
   ```

{% /tab %}

{% tab title="Kubernetes" %}
Collecting logs is disabled by default in the Datadog Agent. To enable it, see [Kubernetes Log Collection](https://docs.datadoghq.com/agent/kubernetes/log.md#setup).

Then, set Log Integrations as pod annotations. This can also be configured with a file, a configmap, or a key-value store. For more information, see the configuration section of [Kubernetes Log Collection](https://docs.datadoghq.com/agent/kubernetes/log.md#configuration).
{% /tab %}

### Validation{% #validation %}

[Run the Agent's status subcommand](https://docs.datadoghq.com/agent/guide/agent-commands.md#agent-status-and-information) and look for `aws_neuron` under the Checks section.

## Data Collected{% #data-collected %}

### Metrics{% #metrics-1 %}

|  |
|  |
| **aws\_neuron.execution.errors.count**(count)                            | Execution errors total                                                   |
| **aws\_neuron.execution.errors\_created**(gauge)                         | Execution errors total                                                   |
| **aws\_neuron.execution.latency\_seconds**(gauge)                        | Execution latency in seconds*Shown as second*                            |
| **aws\_neuron.execution.status.count**(count)                            | Execution status total                                                   |
| **aws\_neuron.execution.status\_created**(gauge)                         | Execution status total                                                   |
| **aws\_neuron.hardware\_ecc\_events.count**(count)                       | Hardware ecc events total                                                |
| **aws\_neuron.hardware\_ecc\_events\_created**(gauge)                    | Hardware ecc events total                                                |
| **aws\_neuron.instance\_info**(gauge)                                    | EC2 instance information                                                 |
| **aws\_neuron.neuron\_hardware\_info**(gauge)                            | Neuron Hardware Information                                              |
| **aws\_neuron.neuron\_runtime.memory\_used\_bytes**(gauge)               | Runtime memory used bytes*Shown as byte*                                 |
| **aws\_neuron.neuron\_runtime.vcpu\_usage\_ratio**(gauge)                | Runtime vCPU utilization ratio*Shown as fraction*                        |
| **aws\_neuron.neuroncore.memory\_usage.constants**(gauge)                | NeuronCore memory utilization for constants*Shown as byte*               |
| **aws\_neuron.neuroncore.memory\_usage.model.code**(gauge)               | NeuronCore memory utilization for model_code*Shown as byte*              |
| **aws\_neuron.neuroncore.memory\_usage.model.shared\_scratchpad**(gauge) | NeuronCore memory utilization for model_shared_scratchpad*Shown as byte* |
| **aws\_neuron.neuroncore.memory\_usage.runtime\_memory**(gauge)          | NeuronCore memory utilization for runtime_memory*Shown as byte*          |
| **aws\_neuron.neuroncore.memory\_usage.tensors**(gauge)                  | NeuronCore memory utilization for tensors*Shown as byte*                 |
| **aws\_neuron.neuroncore.utilization\_ratio**(gauge)                     | NeuronCore utilization ratio*Shown as fraction*                          |
| **aws\_neuron.process.cpu\_seconds.count**(count)                        | Total user and system CPU time spent in seconds.*Shown as second*        |
| **aws\_neuron.process.max\_fds**(gauge)                                  | Maximum number of open file descriptors.                                 |
| **aws\_neuron.process.open\_fds**(gauge)                                 | Number of open file descriptors.                                         |
| **aws\_neuron.process.resident\_memory\_bytes**(gauge)                   | Resident memory size in bytes.*Shown as byte*                            |
| **aws\_neuron.process.start\_time\_seconds**(gauge)                      | Start time of the process since unix epoch in seconds.*Shown as second*  |
| **aws\_neuron.process.virtual\_memory\_bytes**(gauge)                    | Virtual memory size in bytes.*Shown as byte*                             |
| **aws\_neuron.python\_gc.collections.count**(count)                      | Number of times this generation was collected                            |
| **aws\_neuron.python\_gc.objects\_collected.count**(count)               | Objects collected during gc                                              |
| **aws\_neuron.python\_gc.objects\_uncollectable.count**(count)           | Uncollectable objects found during GC                                    |
| **aws\_neuron.python\_info**(gauge)                                      | Python platform information                                              |
| **aws\_neuron.system.memory.total\_bytes**(gauge)                        | System memory total_bytes bytes*Shown as byte*                           |
| **aws\_neuron.system.memory.used\_bytes**(gauge)                         | System memory used_bytes bytes*Shown as byte*                            |
| **aws\_neuron.system.swap.total\_bytes**(gauge)                          | System swap total_bytes bytes*Shown as byte*                             |
| **aws\_neuron.system.swap.used\_bytes**(gauge)                           | System swap used_bytes bytes*Shown as byte*                              |
| **aws\_neuron.system.vcpu.count**(gauge)                                 | System vCPU count                                                        |
| **aws\_neuron.system.vcpu.usage\_ratio**(gauge)                          | System CPU utilization ratio*Shown as fraction*                          |

### Events{% #events %}

The AWS Neuron integration does not include any events.

### Service Checks{% #service-checks %}

**aws\_neuron.openmetrics.health**

Returns `CRITICAL` if the Agent is unable to connect to the Neuron Monitor OpenMetrics endpoint, otherwise returns `OK`.

*Statuses: ok, critical*

## Troubleshooting{% #troubleshooting %}

In containerized environments, ensure that the Agent has network access to the endpoints specified in the `aws_neuron.d/conf.yaml` file.

Need help? Contact [Datadog support](https://docs.datadoghq.com/help/).