BentoML

Supported OS Linux Windows Mac OS

Intégration1.1.0
Cette page n'est pas encore disponible en français, sa traduction est en cours.
Si vous avez des questions ou des retours sur notre projet de traduction actuel, n'hésitez pas à nous contacter.

Overview

This check monitors BentoML through the Datadog Agent.

BentoML is an open-source platform for building, shipping, and running machine learning models in production. This integration enables you to track the health and performance of your BentoML model serving infrastructure directly from Datadog.

By using this integration, you gain visibility into key BentoML metrics such as request throughput, response latency, error rates, and resource utilization. Monitoring these metrics helps you ensure reliable model deployments, quickly detect issues, and optimize the performance of your ML services in production environments.

Minimum Agent version: 7.70.1

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

Starting with Agent version 7.71.0, the BentoML check is included in the Datadog Agent package. No additional installation is needed on your environment.

Configuration

Metrics

The BentoML integration collects data from both the health API endpoints and the Prometheus metrics endpoint. By default, BentoML exposes these endpoints, so in most cases, no additional configuration is required on the BentoML side. For more information about these endpoints and how to enable or secure them, refer to the BentoML observability documentation.

To configure the Datadog Agent to collect BentoML metrics:

  1. Edit the bentoml.d/conf.yaml file, located in the conf.d/ directory at the root of your Agent’s configuration folder. This file controls how the Agent collects metrics from your BentoML deployment. For a full list of configuration options, see the sample bentoml.d/conf.yaml. Below is a minimal example configuration:
init_config:
instances:
  - openmetrics_endpoint: http://localhost:3000/metrics
    tags:
    - bentoml_service: foo # Tag to easily scope metrics
  1. Restart the Agent.

Logs

BentoML logs can be collected by the Datadog Agent using several methods:

  • Agent log collection (recommended): Configure the Datadog Agent to tail BentoML log files. See the BentoML documentation for more details.

For host-based Agents:

  1. Enable log collection in your datadog.yaml file (disabled by default):

    logs_enabled: true
    
  2. Configure the Agent to tail BentoML logs by editing bentoml.d/conf.yaml (or the corresponding file in conf.d/):

    logs:
      - type: file
        path: monitoring/text_summarization/data/*.log
        source: bentoml
        service: <SERVICE>
    

    Replace <SERVICE> with a name that matches your service.

For containerized environments:

  • Ensure the BentoML log files are mounted inside the Datadog Agent container so they can be accessed and tailed. See container based log collection for more information.

Other log shipping options:

Choose the log collection method that best fits your environment and operational needs. Ensure the logs are tagged correctly with source:bentoml.

Validation

Run the Agent’s status subcommand and look for bentoml under the Checks section.

Data Collected

Metrics

bentoml.endpoint_livez
(gauge)
A liveness probe endpoint that checks if the Service is still alive and needs to restart. This metrics reports a 1 if the endpoint is healthy, 0 otherwise.
bentoml.endpoint_readyz
(gauge)
A readiness probe endpoint that indicates if the Service is ready to accept traffic. This metrics reports a 1 if the endpoint is healthy, 0 otherwise.
bentoml.service.adaptive_batch_size.bucket
(count)
The number of observations since last data collection that fall within a specific upper_bound tag for adaptive batch sizes used during Service execution histogram measurements.
bentoml.service.adaptive_batch_size.count
(count)
The number of observations since last data collection for adaptive batch sizes used during Service execution histogram measurements.
bentoml.service.adaptive_batch_size.sum
(count)
The total batch size units across all observations since last data collection for adaptive batch sizes used during Service execution histogram measurements.
bentoml.service.request.count
(count)
The number of new requests that a Service has processed since last submission
Shown as request
bentoml.service.request.duration.bucket
(count)
The number of observations since last data collection that fall within a specific upper_bound tag for request processing duration histogram measurements.
bentoml.service.request.duration.count
(count)
The number of requests processed since last data collection histogram measurements.
bentoml.service.request.duration.sum
(count)
The total sum of request processing time in seconds across all observations since last data collection histogram measurements.
Shown as second
bentoml.service.request.in_progress
(gauge)
The number of requests that are currently being processed by a Service.
Shown as request
bentoml.service.time_since_last_request
(gauge)
The amount of time in seconds since the last request was processed by a Service.
Shown as second

Events

The BentoML integration does not include any events.

Service Checks

See service_checks.json for a list of service checks provided by this integration.

Troubleshooting

Need help? Contact Datadog support.