Databricks (Zerobus) Destination
This product is not supported for your selected
Datadog site. (
).
Join the Preview!
The Databricks (Zerobus) destination is in Preview. Contact your account manager to request access.
Overview
Use Observability Pipelines’ Databricks (Zerobus) destination to send logs to a Databricks Unity Catalog table. The destination streams logs to the Zerobus Ingest API and authenticates to Databricks with an OAuth service principal.
Prerequisites
Before you configure the Databricks (Zerobus) destination, you must:
Set up a schema and table
The SQL examples in this section use the following placeholders:
| Placeholder | Description | Example |
|---|
<USER> | The user who creates the schema and table. | databricks-user@example.com |
<CATALOG_NAME> | The Unity Catalog name. | main |
<SCHEMA_NAME> | The schema name. | obs_pipelines |
<TABLE_NAME> | The table name. | apache_common_logs |
<YOUR_MANAGED_LOCATION> | (Optional) The managed location URI. | s3://your-bucket/managed |
Note: The GRANT commands must be run by a Databricks workspace admin.
In the Databricks workspace:
If you’re not a Databricks workspace admin, have an admin run the following command to grant your user permission to create a schema:
GRANT CREATE SCHEMA ON CATALOG <CATALOG_NAME> TO <USER>;
Create the schema:
CREATE SCHEMA IF NOT EXISTS <CATALOG_NAME>.<SCHEMA_NAME>
MANAGED LOCATION '<YOUR_MANAGED_LOCATION>';
- Note:
MANAGED LOCATION is optional. See Databricks’ Create Schemas documentation for more information.
If you’re not an admin user, have an admin run the following command to grant your user permission to create a table on the schema:
GRANT CREATE TABLE ON SCHEMA <CATALOG_NAME>.<SCHEMA_NAME> TO <USER>;
Run the following command to create the table that Observability Pipelines writes log data to:
CREATE TABLE <CATALOG_NAME>.<SCHEMA_NAME>.<TABLE_NAME> (
host STRING,
message STRING,
service STRING,
source_type STRING,
timestamp TIMESTAMP
);
The fully qualified table name is catalog.schema.table, for example main.obs_pipelines.apache_common_logs. This is the value you enter for Table Name when you set up the Observability Pipelines Databricks destination.
Set up a service principal
The Databricks Zerobus Ingest API uses OAuth authentication. When you create the service principal, the OAuth client secret is generated and the OAuth client ID is the service principal’s UUID.
To create a service principal:
- In your Databricks workspace, navigate to User Settings > Identity and access > Service principals.
- Click Add service principal.
- After the service principal is created, generate an OAuth secret for it.
- Take note of the service principal’s Application ID (client ID) and the OAuth client secret. You need both of them when you configure the Observability Pipelines Databricks destination.
- Run this SQL in Databricks to grant the service principal access to the catalog, schema, and table. Replace
<SERVICE_PRINCIPAL_UUID> with the service principal’s application ID from the previous step:GRANT USE CATALOG ON CATALOG <CATALOG_NAME> TO <SERVICE_PRINCIPAL_UUID>;
GRANT USE SCHEMA ON SCHEMA <CATALOG_NAME>.<SCHEMA_NAME> TO <SERVICE_PRINCIPAL_UUID>;
GRANT SELECT, MODIFY ON TABLE <CATALOG_NAME>.<SCHEMA_NAME>.<TABLE_NAME> TO <SERVICE_PRINCIPAL_UUID>;
See Databricks’ Add service principals to your account and Grant permissions on an object documentation for more information.
Setup
Configure the Databricks (Zerobus) destination when you set up a pipeline. You can set up a pipeline in the UI, using the API, or with Terraform. The steps in this section are configured in the UI.
Note: Log fields that are not present in the table schema are dropped. For example, if a log has the fields id, name, and host, and the table schema only contains the columns name and host, then the id field is dropped and not written to the table.
After you select the Databricks (Zerobus) destination in the pipeline UI:
For Secrets Management: Only enter the identifier for the OAuth client secret. Do not enter the actual value.
If you enter secret identifiers and then choose to use environment variables, the environment variable is the identifier entered and prepended with DD_OP_. For example, if you entered PASSWORD_1 for a password identifier, the environment variable for that password is DD_OP_PASSWORD_1.
- Enter the Ingestion Endpoint for your Databricks workspace, such as
https://<workspace_id>.zerobus.<region>.cloud.databricks.com. The Worker sends logs to this endpoint. - Enter the Table Name in the format
catalog.schema.table, such as main.obs_pipelines.apache_common_logs. - Enter the Unity Catalog Endpoint for your Databricks workspace, such as
https://<workspace>.cloud.databricks.com. The Worker uses this endpoint to read the table’s schema. - In the Auth - Client ID field, enter the application ID of the service principal, such as
abcdefgh-1234-5678-abcd-ef0123456789. - In the Auth - Client Secret field, enter the identifier for your OAuth client secret. If you leave it blank, the default is used.
Optional settings
Buffering
Toggle the switch to enable Buffering Options. Enable a configurable buffer on your destination to ensure intermittent latency or an outage at the destination doesn’t create immediate backpressure, and allow events to continue to be ingested from your source. Disk buffers can also increase pipeline durability by writing data to disk, ensuring buffered data persists through a Worker restart. See Destination buffers for more information.
- If left unconfigured, your destination uses a memory buffer with a capacity of 500 events.
- To configure a buffer on your destination:
- Select the buffer type you want to set (Memory or Disk).
- Enter the buffer size and select the unit.
- Maximum memory buffer size is 128 GB.
- Maximum disk buffer size is 500 GB.
- In the Behavior on full buffer dropdown menu, select whether you want to block events or drop new events when the buffer is full.
If your logs have timestamps in string format and your Databricks table has a timestamp column declared as a TIMESTAMP type, you must convert the string to timestamp format before sending logs to the Databricks (Zerobus) destination. Databricks (Zerobus) can only convert the timestamp format to its TIMESTAMP type.
If you do not convert the string timestamp, the Worker throws an error similar to:
Protobuf encoding failed: Error converting timestamp field: Can't convert '2012-04-23T10[41]15Z' to i64: invalid digit found in string
To convert timestamps in string format to timestamp format:
- Add a Custom Processor to your pipeline.
- Add a function with the following custom script:
.timestamp = parse_timestamp!(.timestamp, format: "%+")
See parse_timestamp for more information.
Secret defaults
These are the defaults used for secret identifiers and environment variables.
- Databricks OAuth client secret identifier:
- References the OAuth client secret for the service principal the Observability Pipelines Worker uses to authenticate to Databricks.
- The default identifier is
DESTINATION_DATABRICKS_ZEROBUS_OAUTH_CLIENT_SECRET.
- Databricks OAuth client secret:
- The OAuth client secret for the service principal the Observability Pipelines Worker uses to authenticate to Databricks.
- The default environment variable is
DD_OP_DESTINATION_DATABRICKS_ZEROBUS_OAUTH_CLIENT_SECRET.
How the destination works
Event batching
A batch of events is flushed when one of these parameters is met. See event batching for more information.
| Maximum Events | Maximum Size (MB) | Timeout (seconds) |
|---|
| None | 10 | 1 |