---
title: Kubeflow
description: Integration for Kubeflow
breadcrumbs: Docs > Integrations > Kubeflow
---

# Kubeflow
Supported OS Integration version2.4.0
## Overview{% #overview %}

This check monitors [Kubeflow](https://docs.datadoghq.com/integrations/kubeflow.md) through the Datadog Agent.

**Minimum Agent version:** 7.59.0

## Setup{% #setup %}

{% alert level="warning" %}
This integration is currently released in Preview mode. Its availability is subject to change in the future.
{% /alert %}

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the [Autodiscovery Integration Templates](https://docs.datadoghq.com/agent/kubernetes/integrations.md) for guidance on applying these instructions.

### Installation{% #installation %}

The Kubeflow check is included in the [Datadog Agent](https://app.datadoghq.com/account/settings/agent/latest) package. No additional installation is needed on your server.

### Configuration{% #configuration %}

1. Edit the `kubeflow.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your kubeflow performance data. See the [sample kubeflow.d/conf.yaml](https://github.com/DataDog/integrations-core/blob/master/kubeflow/datadog_checks/kubeflow/data/conf.yaml.example) for all available configuration options.

1. [Restart the Agent](https://docs.datadoghq.com/agent/guide/agent-commands.md#start-stop-and-restart-the-agent).

#### Metric collection{% #metric-collection %}

Make sure that the Prometheus-formatted metrics are exposed for your `kubeflow` componenet. For the Agent to start collecting metrics, the `kubeflow` pods need to be annotated.

Kubeflow has metrics endpoints that can be accessed on port `9090`.

To enable metrics exposure in kubeflow through prometheus, you might need to enable the prometheus service monitoring for the component in question.

You can use Kube-Prometheus-Stack or a custom Prometheus installation.

##### How to install Kube-Prometheus-Stack:{% #how-to-install-kube-prometheus-stack %}

1. Add Helm Repository:

```
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
```
Install the Chart:
```
helm install prometheus-stack prometheus-community/kube-prometheus-stack
```
Expose Prometheus service externally:
```
kubectl port-forward prometheus-stack 9090:9090
```

##### Set Up ServiceMonitors for Kubeflow Components:{% #set-up-servicemonitors-for-kubeflow-components %}

You need to configure ServiceMonitors for Kubeflow components to expose their Prometheus metrics. If your Kubeflow component exposes Prometheus metrics by default. You'll just need to configure Prometheus to scrape these metrics.

The ServiceMonitor would look like this:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: <kubeflow-component>-monitor
  labels:
    release: prometheus-stack
spec:
  selector:
    matchLabels:
      app: <kubeflow-component-name>
  endpoints:
  - port: http
    path: /metrics
```

Where `<kubeflow-component>` is to be replaced by `pipelines`, `kserve` or `katib` and `<kubeflow-component-name>` is to be replaced by `ml-pipeline`, `kserve` or `katib`.

**Note**: The listed metrics can only be collected if they are available(depending on the version). Some metrics are generated only when certain actions are performed.

The only parameter required for configuring the `kubeflow` check is `openmetrics_endpoint`. This parameter should be set to the location where the Prometheus-formatted metrics are exposed. The default port is `9090`. In containerized environments, `%%host%%` should be used for [host autodetection](https://docs.datadoghq.com/agent/kubernetes/integrations.md).

```yaml
apiVersion: v1
kind: Pod
# (...)
metadata:
  name: '<POD_NAME>'
  annotations:
    ad.datadoghq.com/controller.checks: |
      {
        "kubeflow": {
          "init_config": {},
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:9090/metrics"
            }
          ]
        }
      }
    # (...)
spec:
  containers:
    - name: 'controller'
# (...)
```

### Validation{% #validation %}

[Run the Agent's status subcommand](https://docs.datadoghq.com/agent/guide/agent-commands.md#agent-status-and-information) and look for `kubeflow` under the Checks section.

## Data Collected{% #data-collected %}

### Metrics{% #metrics %}

|  |
|  |
| **kubeflow.katib.controller.reconcile.count**(count)                   | Number of reconcile loops executed by the Katib controller                            |
| **kubeflow.katib.controller.reconcile.duration.seconds.bucket**(count) | Duration of reconcile loops executed by the Katib controller(bucket)                  |
| **kubeflow.katib.controller.reconcile.duration.seconds.count**(count)  | Duration of reconcile loops executed by the Katib controller(count)                   |
| **kubeflow.katib.controller.reconcile.duration.seconds.sum**(count)    | Duration of reconcile loops executed by the Katib controller(sum)*Shown as second*    |
| **kubeflow.katib.experiment.created.count**(count)                     | Total number of experiments created                                                   |
| **kubeflow.katib.experiment.duration.seconds.bucket**(count)           | Duration of experiments from start to completion(bucket)                              |
| **kubeflow.katib.experiment.duration.seconds.count**(count)            | Duration of experiments from start to completion(count)                               |
| **kubeflow.katib.experiment.duration.seconds.sum**(count)              | Duration of experiments from start to completion(sum)*Shown as second*                |
| **kubeflow.katib.experiment.failed.count**(count)                      | Number of experiments that have failed                                                |
| **kubeflow.katib.experiment.running.total**(gauge)                     | Number of experiments currently running                                               |
| **kubeflow.katib.experiment.succeeded.count**(count)                   | Number of experiments that have successfully completed                                |
| **kubeflow.katib.suggestion.created.count**(count)                     | Total number of suggestions made                                                      |
| **kubeflow.katib.suggestion.duration.seconds.bucket**(count)           | Duration of suggestion processes from start to completion(bucket)                     |
| **kubeflow.katib.suggestion.duration.seconds.count**(count)            | Duration of suggestion processes from start to completion(count)                      |
| **kubeflow.katib.suggestion.duration.seconds.sum**(count)              | Duration of suggestion processes from start to completion(sum)*Shown as second*       |
| **kubeflow.katib.suggestion.failed.count**(count)                      | Number of suggestions that have failed                                                |
| **kubeflow.katib.suggestion.running.total**(gauge)                     | Number of suggestions currently being processed                                       |
| **kubeflow.katib.suggestion.succeeded.count**(count)                   | Number of suggestions that have successfully completed                                |
| **kubeflow.katib.trial.created.count**(count)                          | Total number of trials created                                                        |
| **kubeflow.katib.trial.duration.seconds.bucket**(count)                | Duration of trials from start to completion(bucket)                                   |
| **kubeflow.katib.trial.duration.seconds.count**(count)                 | Duration of trials from start to completion(count)                                    |
| **kubeflow.katib.trial.duration.seconds.sum**(count)                   | Duration of trials from start to completion(sum)*Shown as second*                     |
| **kubeflow.katib.trial.failed.count**(count)                           | Number of trials that have failed                                                     |
| **kubeflow.katib.trial.running.total**(gauge)                          | Number of trials currently running                                                    |
| **kubeflow.katib.trial.succeeded.count**(count)                        | Number of trials that have successfully completed                                     |
| **kubeflow.kserve.inference.duration.seconds.bucket**(count)           | Duration of inference requests(bucket)                                                |
| **kubeflow.kserve.inference.duration.seconds.count**(count)            | Duration of inference requests(count)                                                 |
| **kubeflow.kserve.inference.duration.seconds.sum**(count)              | Duration of inference requests(sum)*Shown as second*                                  |
| **kubeflow.kserve.inference.errors.count**(count)                      | Number of errors encountered during inference                                         |
| **kubeflow.kserve.inference.request.bytes.bucket**(count)              | Size of inference request payloads(bucket)                                            |
| **kubeflow.kserve.inference.request.bytes.count**(count)               | Size of inference request payloads(count)                                             |
| **kubeflow.kserve.inference.request.bytes.sum**(count)                 | Size of inference request payloads(sum)*Shown as byte*                                |
| **kubeflow.kserve.inference.response.bytes.bucket**(count)             | Size of inference response payloads(bucket)                                           |
| **kubeflow.kserve.inference.response.bytes.count**(count)              | Size of inference response payloads(count)                                            |
| **kubeflow.kserve.inference.response.bytes.sum**(count)                | Size of inference response payloads(sum)*Shown as byte*                               |
| **kubeflow.kserve.inferences.count**(count)                            | Total number of inferences made                                                       |
| **kubeflow.notebook.server.created.count**(count)                      | Total number of notebook servers created                                              |
| **kubeflow.notebook.server.failed.count**(count)                       | Number of notebook servers that have failed                                           |
| **kubeflow.notebook.server.reconcile.count**(count)                    | Number of reconcile loops executed by the notebook controller                         |
| **kubeflow.notebook.server.reconcile.duration.seconds.bucket**(count)  | Duration of reconcile loops executed by the notebook controller(bucket)               |
| **kubeflow.notebook.server.reconcile.duration.seconds.count**(count)   | Duration of reconcile loops executed by the notebook controller(count)                |
| **kubeflow.notebook.server.reconcile.duration.seconds.sum**(count)     | Duration of reconcile loops executed by the notebook controller(sum)*Shown as second* |
| **kubeflow.notebook.server.running.total**(gauge)                      | Number of notebook servers currently running                                          |
| **kubeflow.notebook.server.succeeded.count**(count)                    | Number of notebook servers that have successfully completed                           |
| **kubeflow.pipeline.run.duration.seconds.bucket**(count)               | Duration of pipeline runs(bucket)                                                     |
| **kubeflow.pipeline.run.duration.seconds.count**(count)                | Duration of pipeline runs(count)                                                      |
| **kubeflow.pipeline.run.duration.seconds.sum**(count)                  | Duration of pipeline runs(sum)*Shown as second*                                       |
| **kubeflow.pipeline.run.status**(gauge)                                | Status of pipeline runs                                                               |

### Events{% #events %}

The Kubeflow integration does not include any events.

### Service Checks{% #service-checks %}

**kubeflow.openmetrics.health**

Returns `CRITICAL` if the Agent is unable to connect to the Kubeflow OpenMetrics endpoint, otherwise returns `OK`.

*Statuses: ok, critical*

## Troubleshooting{% #troubleshooting %}

Need help? Contact [Datadog support](https://docs.datadoghq.com/help/).
