Data Jobs Monitoring

Data Jobs Monitoring provides visibility into the performance, reliability, and cost efficiency of your data processing jobs, along with the underlying infrastructure. Data Jobs Monitoring enables you to:

Track the health and performance of data processing jobs across your accounts and workspaces. See which take up the most compute resources or have inefficiencies.
Receive an alert when a job fails—or when a job is taking too long to complete.
Analyze job execution details and stack traces.
Correlate infrastructure metrics, Spark metrics from the Spark UI, logs, and cluster configuration.
Compare multiple runs to facilitate troubleshooting, and to optimize provisioning and configuration during deployment.

Setup

Data Jobs Monitoring supports the monitoring of jobs on Amazon EMR, Databricks (AWS, Azure, Google Cloud), Google Dataproc, Spark on Kubernetes, and Apache Airflow.

To get started, select your platform and follow the installation instructions:

Explore Data Jobs Monitoring

Easily identify unreliable and inefficient jobs

View all jobs across cloud accounts and workspaces. Identify failing jobs to take action on, or find jobs with high idle CPU that are using a lot of compute and should be optimized.

Receive alerts on problematic jobs

Datadog monitors send alerts when a job fails, or is running beyond its completion time. Browse monitor templates to monitor data jobs specific to your installed integrations.

Analyze and troubleshoot individual jobs

Click on a job to see how it performed across multiple runs, as well as error messages for failed runs.

Job Overview page for 'product-insights' Spark Application job

Analyze an individual run

Clicking on a run opens a side panel with details of how much time was spent on each Spark job and stage, along with a breakdown of resource consumption and Spark metrics, such as idle executor CPU, input/output data volume, shuffling, and disk spill. From this panel, you can correlate the execution with executor and driver node resource utilization, logs, and the job and cluster configuration.

On the Infrastructure tab, you can correlate the execution to infrastructure metrics.

Data Jobs Monitoring > Run panel, Infrastructure tab

For a failed run, look at the Errors tab to see the stack trace, which can help you determine where and how this failure occurred.

To determine why a stage is taking a long time to complete, you can use the Spark Task Metrics tab to view task-level metrics for a specific Spark stage, so that you can identify data skew. See the distribution of time spent and data consumed by different tasks.