Data Jobs Monitoring

Join the Beta!

Data Jobs Monitoring is in private beta. Fill out this form to join the wait list.

Request Access
Datadog Data Jobs Monitoring

Data Jobs Monitoring provides visibility into the performance and reliability of data processing jobs, including Apache Spark and Databricks jobs, along with the underlying infrastructure. Data Jobs Monitoring enables you to:

  • Track the health and performance of data processing jobs across your accounts and workspaces. See which take up the most compute resources or have inefficiencies.
  • Receive an alert when a job fails—or when a job is taking too long to complete.
  • Analyze job execution details and stack traces.
  • Correlate infrastructure metrics, Spark metrics from the Spark UI, logs, and cluster configuration.
  • Compare multiple runs to facilitate troubleshooting, and to optimize provisioning and configuration during deployment.

Setup

Data Jobs Monitoring is supported for Amazon EMR, Databricks (AWS, Azure, Google Cloud), and Spark on Kubernetes.

To get started, select your platform and follow the installation instructions:

Amazon EMR
Databricks
Kubernetes

Explore Data Jobs Monitoring

Easily identify unreliable and inefficient jobs

View all jobs across cloud accounts and workspaces. Identify failing jobs to take action on, or find jobs with high idle CPU that are using a lot of compute and should be optimized.

Job list - CPU breakdown

Receive alerts on problematic jobs

Datadog monitors send alerts when a job fails, or is running beyeond its completion time.

Monitors - templates for Data Jobs Monitoring

Analyze and troubleshoot individual jobs

Click on a job to see how it performed across multiple runs, as well as error messages for failed runs.

Job list - CPU breakdown

Analyze an individual run

Clicking on a run opens a side panel with details of how much time was spent on each Spark job and stage, along with a breakdown of resource consumption and Spark metrics, such as idle executor CPU, input/output data volume, shuffling, and disk spill. From this panel, you can correlate the execution with executor and driver node resource utilization, logs, and the job and cluster configuration.

Data Jobs Monitoring > Run panel, Info tab

On the Infrastructure tab, you can correlate the execution to infrastructure metrics.

Data Jobs Monitoring > Run panel, Infrastructure tab

To determine why a stage is taking a long time to complete,you can use the Spark Task Metrics tab to view task-level metrics for a specific Spark stage, so that you can identify data skew. See the distribution of time spent and data consumed by different tasks.

Data Jobs Monitoring > Run panel, Spark Task Metrics tab

For a failed run, look at the Errors tab to see the stack trace, which can help you determine where and how this failure occurred.

Data Jobs Monitoring > Run panel, Errors tab

Further Reading

Additional helpful documentation, links, and articles: