---
title: Databricks
description: Monitor the reliability and cost of your Databricks environment.
breadcrumbs: Docs > Integrations > Databricks
---

> For the complete documentation index, see [llms.txt](https://docs.datadoghq.com/llms.txt).

# Databricks
Integration version1.0.0
{% callout %}
# Important note for users on the following Datadog sites: us2.ddog-gov.com

{% alert level="info" %}
To find out if this integration is available in your organization, see your [Datadog Integrations](https://app.datadoghq.com/integrations) page or ask your organization administrator.

To initiate an exception request to enable this integration for your organization, email [support@ddog-gov.com](mailto:support@ddog-gov.com).
{% /alert %}

{% /callout %}


## Overview{% #overview %}

Datadog offers several Databricks monitoring capabilities.

[Data Observability: Jobs Monitoring](https://www.datadoghq.com/product/data-jobs-monitoring/) provides monitoring for your Databricks jobs and clusters. You can detect problematic Databricks jobs and workflows anywhere in your data pipelines, remediate failed and long-running-jobs faster, and optimize cluster resources to reduce costs.

[Cloud Cost Management](https://www.datadoghq.com/product/cloud-cost-management/) gives you a view to analyze all your Databricks DBU costs alongside the associated cloud spend.

[Log Management](https://www.datadoghq.com/product/log-management/) enables you to aggregate and analyze logs from your Databricks jobs & clusters. You can collect these logs as part of [Data Observability: Jobs Monitoring](https://www.datadoghq.com/product/data-jobs-monitoring/).

[Infrastructure Monitoring](https://docs.datadoghq.com/integrations/databricks.md?tab=driveronly) gives you a limited subset of the Data Observability: Jobs Monitoring functionality - visibility into the resource utilization of your Databricks clusters and Apache Spark performance metrics.

[Reference Tables](https://docs.datadoghq.com/reference_tables.md) allow you to import metadata from your Databricks workspace into Datadog. These tables enrich your Datadog telemetry with critical context like workspace names, job definitions, cluster configurations, and user roles.

[Data Observability](https://docs.datadoghq.com/data_observability.md) helps data teams detect, resolve, and prevent issues affecting data quality, performance, and cost. It monitors anomalies in volume, freshness, null rates, and distributions, and integrates with pipelines to correlate issues with job runs, data streams, and infrastructure events.

Model-serving metrics provide insights into how your Databricks model-serving infrastructure is performing. With these metrics, you can detect endpoints that have high error rates, high latency, are over- or underprovisioned, and more.

## Setup{% #setup %}

### Installation{% #installation %}

First, connect a new Databricks workspace in Datadog's Databricks integration tile. Complete installation by configuring one or more capabilities of the integration: Data Jobs Monitoring, Cloud Cost Management, and Model Serving.

### Configuration{% #configuration %}

#### Connect to a new Databricks Workspace{% #connect-to-a-new-databricks-workspace %}

{% tab title="Use a Service Principal for OAuth" %}

{% alert level="warning" %}
New workspace integrations must authenticate using OAuth. Workspaces already integrated with a Personal Access Token continue to function and can switch to OAuth at any time. After a workspace starts using OAuth, it cannot revert to a Personal Access Token.
{% /alert %}

1. As a **Databricks workspace admin**, go to **Settings** by clicking on your profile in the upper-right corner from within a workspace.
1. Under the **Identity and access** tab, click **Manage** next to **Service principals**. Click **Add service principal**, then **Add new**. Enter a name, then click **Add**.
1. Click on the name of your new service principal. Under the **Secrets** tab, click **Generate secret**. Set **Lifetime (days)** to the maximum value allowed (730), then click **Generate**. Take note of your client ID and client secret.
1. Under the **Permissions** tab, click **Grant access**. Search for the new service principal and grant it the **Manage** permission. Click **Save**.
1. Go back to the **Identity and access** tab and click **Manage** next to **Groups**. Click the **admins** group and add the new service principal by clicking **Add members**.
1. In Datadog, open the Databricks integration tile.
1. On the **Configure** tab, click **Add Databricks Workspace**.
1. Enter a workspace name, your Databricks workspace URL, and the client ID and secret you generated.

{% /tab %}

{% tab title="Use a Personal Access Token (Legacy)" %}

{% alert level="warning" %}
This option is only available for workspace integrations created before July 7, 2025. New workspace integrations must authenticate using OAuth.
{% /alert %}

1. In your Databricks workspace, click on your profile in the top right corner and go to **Settings**. Select **Developer** in the left side bar. Next to **Access tokens**, click **Manage**.

1. Click **Generate new token**, enter "Datadog Integration" in the **Comment** field, remove the default value in **Lifetime (days)**, and click **Generate**. Take note of your token.

**Important:**

   - Make sure you delete the default value in **Lifetime (days)** so that the token doesn't expire and the integration doesn't break.
   - Ensure the account generating the token has [CAN VIEW access](https://docs.databricks.com/en/security/auth-authz/access-control/index.html#job-acls) for the Databricks jobs and clusters you want to monitor.

As an alternative, follow the [official Databricks documentation](https://docs.databricks.com/en/dev-tools/auth/pat.html) to generate an access token for a [service principal](https://docs.databricks.com/en/admin/users-groups/service-principals.html#what-is-a-service-principal).

1. In Datadog, open the Databricks integration tile.

1. On the **Configure** tab, click **Add Databricks Workspace**.

1. Enter a workspace name, your Databricks workspace URL, and the Databricks token you generated.

{% /tab %}

#### Data Jobs Monitoring{% #data-jobs-monitoring %}

1. Connect a workspace in Datadog's Databricks integration tile.
1. In the **Select products to set up integration** section, set **Data Jobs Monitoring** to **Enabled** to start monitoring Databricks jobs and clusters.
1. See [the docs for Data Jobs Monitoring](https://docs.datadoghq.com/data_jobs/databricks.md) to complete the configuration.

**Note**: Ensure that the user or service principal being used has the necessary permissions to access your Databricks cost data.

#### Cloud Cost Management{% #cloud-cost-management %}

1. Connect a workspace in Datadog's Databricks integration tile.
1. In the **Select products to set up integration** section, set **Cloud Cost Management** to **Enabled** to view and analyze Databricks DBU costs alongside the associated cloud cost.

**Note**: Ensure that the user or service principal being used has the necessary permissions to access your Databricks cost data.

#### Model Serving{% #model-serving %}

1. Configure a workspace in Datadog's Databricks integration tile.
1. In the **Select resources to set up collection** section, set **Metrics - Model Serving** to **Enabled** in order to ingest model serving metrics.

#### Reference Table Configuration{% #reference-table-configuration %}

1. Configure a workspace in Datadog's Databricks integration tile.
1. In the accounts detail panel, click **Reference Tables**.
1. In the **Reference Tables** tab, click **Add New Reference Table**.
1. Provide the **Reference table name**, **Databricks table name**, and **Primary key** of your Databricks view or table.

- For optimal results, create a view in Databricks that includes only the specific data you want to send to Datadog. This means generating a dedicated table that reflects the exact scope needed for your use case.
Click **Save**.
#### Data Observability{% #data-observability %}

1. Connect a workspace in Datadog's Databricks integration tile.
1. In the **Select products to set up integration** section, set **Data Observability** to **Enabled** to monitor data quality, freshness, and volume anomalies.
1. See [the docs for Data Observability](https://docs.datadoghq.com/data_observability.md) for more details on configuration and features.

#### Permissions{% #permissions %}

For Datadog to access your Databricks cost data in Data Jobs Monitoring or [Cloud Cost Management](https://docs.datadoghq.com/cloud_cost_management.md), the user or service principal used to query [system tables](https://docs.databricks.com/aws/en/admin/system-tables/) must have the following permissions:

- `CAN USE` permission on the SQL Warehouse.
- Read access to the [system tables](https://docs.databricks.com/aws/en/admin/system-tables/) within Unity Catalog. This can be granted with:

```sql
GRANT USE CATALOG ON CATALOG system TO <service_principal>;
GRANT SELECT ON CATALOG system TO <service_principal>;
GRANT USE SCHEMA ON CATALOG system TO <service_principal>;
```

The user granting these must have the `MANAGE` privilege on `CATALOG system`.

## Data Collected{% #data-collected %}

### Metrics{% #metrics %}

#### Model Serving Metrics{% #model-serving-metrics %}

|  |
|  |
| **databricks.model\_serving.cpu\_usage\_percentage**(gauge)                   | Average CPU utilization used across all replicas during the last minute*Shown as percent*    |
| **databricks.model\_serving.gpu\_mem\_usage\_percentage.avg**(gauge)          | Average GPU memory usage used across all GPUs during the minute*Shown as percent*            |
| **databricks.model\_serving.gpu\_mem\_usage\_percentage.max**(gauge)          | Maximum GPU memory usage used across all GPUs during the minute*Shown as percent*            |
| **databricks.model\_serving.gpu\_mem\_usage\_percentage.min**(gauge)          | Minimum GPU memory usage used across all GPUs during the minute*Shown as percent*            |
| **databricks.model\_serving.gpu\_usage\_percentage.avg**(gauge)               | Average GPU utilization used across all GPUs during the minute*Shown as percent*             |
| **databricks.model\_serving.gpu\_usage\_percentage.max**(gauge)               | Maximum GPU utilization used across all GPUs during the minute*Shown as percent*             |
| **databricks.model\_serving.gpu\_usage\_percentage.min**(gauge)               | Minimum GPU utilization used across all GPUs during the minute*Shown as percent*             |
| **databricks.model\_serving.mem\_usage\_percentage**(gauge)                   | Average memory utilization used across all replicas during the last minute*Shown as percent* |
| **databricks.model\_serving.provisioned\_concurrent\_requests\_total**(gauge) | Number of provisioned concurrency during the last minute*Shown as request*                   |
| **databricks.model\_serving.request\_4xx\_count\_total**(gauge)               | Number of 4xx errors during the last minute*Shown as request*                                |
| **databricks.model\_serving.request\_5xx\_count\_total**(gauge)               | Number of 5xx errors during the last minute*Shown as request*                                |
| **databricks.model\_serving.request\_count\_total**(gauge)                    | Number of requests during the last minute*Shown as request*                                  |
| **databricks.model\_serving.request\_latency\_ms.75percentile**(gauge)        | 75th percentile request latency in milliseconds during the minute*Shown as millisecond*      |
| **databricks.model\_serving.request\_latency\_ms.90percentile**(gauge)        | 90th percentile request latency in milliseconds during the minute*Shown as millisecond*      |
| **databricks.model\_serving.request\_latency\_ms.95percentile**(gauge)        | 95th percentile request latency in milliseconds during the minute*Shown as millisecond*      |
| **databricks.model\_serving.request\_latency\_ms.99percentile**(gauge)        | 99th percentile request latency in milliseconds during the minute*Shown as millisecond*      |

### Service Checks{% #service-checks %}

The Databricks integration does not include any service checks.

### Events{% #events %}

The Databricks integration does not include any events.

## Troubleshooting{% #troubleshooting %}

You can troubleshoot issues yourself by enabling the [Databricks web terminal](https://docs.databricks.com/en/clusters/web-terminal.html) or by using a [Databricks Notebook](https://docs.databricks.com/en/notebooks/index.html). Need help? Contact [Datadog support](https://docs.datadoghq.com/help/).

## Further Reading{% #further-reading %}

- [Troubleshoot and optimize data processing workloads with Data Jobs Monitoring](https://www.datadoghq.com/blog/data-jobs-monitoring/)
- [Observing the data lifecycle with Datadog](https://www.datadoghq.com/blog/data-observability-monitoring/)
- [Monitor Databricks with Datadog](https://www.datadoghq.com/blog/databricks-monitoring-datadog/)
- [Databricks](https://docs.datadoghq.com/integrations/databricks.md)