Databricks

Supported OS Linux Windows Mac OS

Integration version1.0.0

Databricks default dashboard

Overview

Datadog offers several Databricks monitoring capabilities.

Data Jobs Monitoring provides monitoring for your Databricks jobs and clusters. You can detect problematic Databricks jobs and workflows anywhere in your data pipelines, remediate failed and long-running-jobs faster, and optimize cluster resources to reduce costs.

Cloud Cost Management gives you a view to analyze all your Databricks DBU costs alongside the associated cloud spend.

Log Management enables you to aggregate and analyze logs from your Databricks jobs & clusters. You can collect these logs as part of Data Jobs Monitoring.

Infrastructure Monitoring gives you a limited subset of the Data Jobs Monitoring functionality - visibility into the resource utilization of your Databricks clusters and Apache Spark performance metrics.

Reference Tables allow you to import metadata from your Databricks workspace into Datadog. These tables enrich your Datadog telemetry with critical context like workspace names, job definitions, cluster configurations, and user roles.

Model serving metrics provide insights into how your Databricks model serving infrastructure is performing. With these metrics, you can detect endpoints that have high error rate, high latency, are over/under provisioned, and more.

Setup

Installation

First, connect a new Databricks workspace in Datadog’s Databricks integration tile. Complete installation by configuring one or more capabilities of the integration: Data Jobs Monitoring, Cloud Cost Management, and Model Serving.

Configuration

Connect to a new Databricks Workspace

New workspaces must authenticate using OAuth. Workspaces integrated with a Personal Access Token continue to function and can switch to OAuth at any time. After a workspace starts using OAuth, it cannot revert to a Personal Access Token.
  1. In your Databricks account, click on User Management in the left menu. Then, under the Service principals tab, click Add service principal.
  2. Under the Credentials & secrets tab, click Generate secret. Set Lifetime (days) to the maximum value allowed (730), then click Generate. Take note of your client ID and client secret. Also take note of your account ID, which can be found by clicking on your profile in the upper-right corner. (You must be in the account console to retrieve the account ID. The ID will not display inside a workspace.)
  3. Click Workspaces in the left menu, then select the name of your workspace.
  4. Go to the Permissions tab and click Add permissions.
  5. Search for the service principal you created and assign it the Admin permission.
  6. In Datadog, open the Databricks integration tile.
  7. On the Configure tab, click Add Databricks Workspace.
  8. Enter a workspace name, your Databricks workspace URL, account ID, and the client ID and secret you generated.
This option is only available for workspaces created before July 7, 2025. New workspaces must authenticate using OAuth.
  1. In your Databricks workspace, click on your profile in the top right corner and go to Settings. Select Developer in the left side bar. Next to Access tokens, click Manage.

  2. Click Generate new token, enter “Datadog Integration” in the Comment field, remove the default value in Lifetime (days), and click Generate. Take note of your token.

    Important:

    • Make sure you delete the default value in Lifetime (days) so that the token doesn’t expire and the integration doesn’t break.
    • Ensure the account generating the token has CAN VIEW access for the Databricks jobs and clusters you want to monitor.

    As an alternative, follow the official Databricks documentation to generate an access token for a service principal.

  3. In Datadog, open the Databricks integration tile.

  4. On the Configure tab, click Add Databricks Workspace.

  5. Enter a workspace name, your Databricks workspace URL, and the Databricks token you generated.

Data Jobs Monitoring

  1. Connect a workspace in Datadog’s Databricks integration tile.
  2. In the Select products to set up integration section, set Data Jobs Monitoring to Enabled to start monitoring Databricks jobs and clusters.
  3. See the docs for Data Jobs Monitoring to complete the configuration.

Note: Ensure that the user or service principal being used has the necessary permissions to access your Databricks cost data.

Cloud Cost Management

  1. Connect a workspace in Datadog’s Databricks integration tile.
  2. In the Select products to set up integration section, set Cloud Cost Management to Enabled to view and analyze Databricks DBU costs alongside the associated cloud cost.

Note: Ensure that the user or service principal being used has the necessary permissions to access your Databricks cost data.

Model Serving

  1. Configure a workspace in Datadog’s Databricks integration tile.
  2. In the Select resources to set up collection section, set Metrics - Model Serving to Enabled in order to ingest model serving metrics.

Reference Table Configuration

  1. Configure a workspace in Datadog’s Databricks integration tile.
  2. In the accounts detail panel, click Reference Tables.
  3. In the Reference Tables tab, click Add New Reference Table.
  4. Provide the Reference table name, Databricks table name, and Primary key of your Databricks view or table.
  • For optimal results, create a view in Databricks that includes only the specific data you want to send to Datadog. This means generating a dedicated table that reflects the exact scope needed for your use case.
  1. Click Save.

Permissions

For Datadog to access your Databricks cost data in Data Jobs Monitoring or Cloud Cost Management, the user or service principal used to query system tables must have the following permissions:

  • CAN USE permission on the SQL Warehouse.
  • Read access to the system tables within Unity Catalog. This can be granted with:
GRANT USE CATALOG ON CATALOG system TO <service_principal>;
GRANT SELECT ON CATALOG system TO <service_principal>;
GRANT USE SCHEMA ON CATALOG system TO <service_principal>;

The user granting these must have the MANAGE privilege on CATALOG system.

Data Collected

Metrics

Model Serving Metrics

databricks.model_serving.cpu_usage_percentage
(gauge)
Average CPU utilization used across all replicas during the last minute
Shown as percent
databricks.model_serving.gpu_mem_usage_percentage.avg
(gauge)
Average GPU memory usage used across all GPUs during the minute
Shown as percent
databricks.model_serving.gpu_mem_usage_percentage.max
(gauge)
Maximum GPU memory usage used across all GPUs during the minute
Shown as percent
databricks.model_serving.gpu_mem_usage_percentage.min
(gauge)
Minimum GPU memory usage used across all GPUs during the minute
Shown as percent
databricks.model_serving.gpu_usage_percentage.avg
(gauge)
Average GPU utilization used across all GPUs during the minute
Shown as percent
databricks.model_serving.gpu_usage_percentage.max
(gauge)
Maximum GPU utilization used across all GPUs during the minute
Shown as percent
databricks.model_serving.gpu_usage_percentage.min
(gauge)
Minimum GPU utilization used across all GPUs during the minute
Shown as percent
databricks.model_serving.mem_usage_percentage
(gauge)
Average memory utilization used across all replicas during the last minute
Shown as percent
databricks.model_serving.provisioned_concurrent_requests_total
(gauge)
Number of provisioned concurrency during the last minute
Shown as request
databricks.model_serving.request_4xx_count_total
(gauge)
Number of 4xx errors during the last minute
Shown as request
databricks.model_serving.request_5xx_count_total
(gauge)
Number of 5xx errors during the last minute
Shown as request
databricks.model_serving.request_count_total
(gauge)
Number of requests during the last minute
Shown as request
databricks.model_serving.request_latency_ms.75percentile
(gauge)
75th percentile request latency in milliseconds during the minute
Shown as millisecond
databricks.model_serving.request_latency_ms.90percentile
(gauge)
90th percentile request latency in milliseconds during the minute
Shown as millisecond
databricks.model_serving.request_latency_ms.95percentile
(gauge)
95th percentile request latency in milliseconds during the minute
Shown as millisecond
databricks.model_serving.request_latency_ms.99percentile
(gauge)
99th percentile request latency in milliseconds during the minute
Shown as millisecond

Spark Metrics

See the Spark integration documentation for a list of Spark metrics collected.

Service Checks

See the Spark integration documentation for the list of service checks collected.

Events

The Databricks integration does not include any events.

Troubleshooting

You can troubleshoot issues yourself by enabling the Databricks web terminal or by using a Databricks Notebook. Consult the Agent Troubleshooting documentation for information on useful troubleshooting steps.

Need help? Contact Datadog support.

Further Reading