Enable Data Jobs Monitoring for Apache Airflow

문서 > Data Jobs Monitoring > Enable Data Jobs Monitoring for Apache Airflow

이 페이지는 아직 영어로 제공되지 않습니다. 번역 작업 중입니다.
현재 번역 프로젝트에 대한 질문이나 피드백이 있으신 경우 언제든지 연락주시기 바랍니다.

Data Jobs Monitoring for Apache Airflow is in Preview

To try the preview for Airflow monitoring, follow the setup instructions below.

Data Jobs Monitoring provides visibility into the performance and reliability of workflows run by Apache Airflow DAGs.

Requirements

Apache Airflow 2.5.0 or later
apache-airflow-providers-openlineage or openlineage-airflow depending on your Airflow version

Setup

To get started, follow the instructions below.

Install openlineage provider for both Airflow schedulers and Airflow workers by adding the following into your requirements.txt file or wherever your Airflow depedencies are managed:
For Airflow 2.7 or later:
```
apache-airflow-providers-openlineage
```
For Airflow 2.5 & 2.6 :
```
openlineage-airflow
```
Configure openlineage provider. The simplest option is to set the following environment variables and make them available to pods where you run Airflow schedulers and Airflow workers:
```
export OPENLINEAGE_URL=<DD_DATA_OBSERVABILITY_INTAKE>
export OPENLINEAGE_API_KEY=<DD_API_KEY>
# OPENLINEAGE_NAMESPACE sets the 'env' tag value in Datadog. You can hardcode this to a different value
export OPENLINEAGE_NAMESPACE=${AIRFLOW_ENV_NAME}
```
- Replace <DD_DATA_OBSERVABILITY_INTAKE> with https://data-obs-intake..
- Replace <DD_API_KEY> with your valid Datadog API key.
- If you’re using Airflow v2.7 or v2.8, also add these two environment variables along with the previous ones. This fixes an OpenLinage config issue fixed at apache-airflow-providers-openlineage v1.7, while Airflow v2.7 and v2.8 use previous versions.
```
#!/bin/sh
# Required for Airflow v2.7 & v2.8 only
export AIRFLOW__OPENLINEAGE__CONFIG_PATH=""
export AIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS=""
```
Check official documentation configuration-openlineage for other supported configurations of the openlineage provider.
Trigger an update to your Airflow pods and wait for them to finish.
Optionally, set up log collection for correlating task logs to DAG run executions in Data Jobs Monitoring. Correlation requires the logs directory to follow the default log filename format.
The PATH_TO_AIRFLOW_LOGS value is $AIRFLOW_HOME/logs in standard deployments, but may differ if customized. Add the following annotation to your pod:
```
ad.datadoghq.com/base.logs: '[{"type": "file", "path": "PATH_TO_AIRFLOW_LOGS/*/*/*/*.log", "source": "airflow"}]'
```
Adding "source": "airflow" enables the extraction of the correlation-required attributes by the Airflow integration logs pipeline.
Note: Log collection requires the Datadog agent to already be installed on your Kubernetes cluster. If you haven’t installed it yet, see the Kubernetes installation documentation.
For more methods to set up log collection on Kubernetes, see the Kubernetes and Integrations configuration section.

Validation

In Datadog, view the Data Jobs Monitoring page to see a list of your Airflow job runs after the setup.

Troubleshooting

Set OPENLINEAGE_CLIENT_LOGGING to DEBUG along with the other environment variables set previously for OpenLineage client and its child modules. This can be useful in troubleshooting during the configuration of openlineage provider.

Requirements

Apache Airflow 2.5.0 or later
apache-airflow-providers-openlineage or openlineage-airflow depending on your Airflow version

Setup

To get started, follow the instructions below.

Install openlineage provider by adding the following into your requirements.txt file:
For Airflow 2.7 or later:
```
apache-airflow-providers-openlineage
```
For Airflow 2.5 & 2.6 :
```
openlineage-airflow
```
Configure openlineage provider. The simplest option is to set the following environment variables in your Amazon MWAA start script:
```
#!/bin/sh
export OPENLINEAGE_URL=<DD_DATA_OBSERVABILITY_INTAKE>
export OPENLINEAGE_API_KEY=<DD_API_KEY>
# AIRFLOW__OPENLINEAGE__NAMESPACE sets the 'env' tag value in Datadog. You can hardcode this to a different value
export AIRFLOW__OPENLINEAGE__NAMESPACE=${AIRFLOW_ENV_NAME}
```
- Replace <DD_DATA_OBSERVABILITY_INTAKE> fully with https://data-obs-intake..
- Replace <DD_API_KEY> fully with your valid Datadog API key.
- If you’re using Airflow v2.7 or v2.8, also add these two environment variables to the startup script. This fixes an OpenLinage config issue fixed at apache-airflow-providers-openlineage v1.7, while Airflow v2.7 and v2.8 use previous versions.
```
#!/bin/sh
# Required for Airflow v2.7 & v2.8 only
export AIRFLOW__OPENLINEAGE__CONFIG_PATH=""
export AIRFLOW__OPENLINEAGE__DISABLED_FOR_OPERATORS=""
```
Check official documentation configuration-openlineage for other supported configurations of openlineage provider.
Deploy your updated requirements.txt and Amazon MWAA startup script to your Amazon S3 folder configured for your Amazon MWAA Environment.
Optionally, set up Log Collection for correlating task logs to DAG run executions in DJM:
1. Configure Amazon MWAA to send logs to CloudWatch.
2. Send the logs to Datadog.

Validation

In Datadog, view the Data Jobs Monitoring page to see a list of your Airflow job runs after the setup.

Troubleshooting

Ensure your Execution role configured for your Amazon MWAA Environment has the right permissions to the requirements.txt and Amazon MWAA start script. This is required if you are managing your own Execution role and it’s the first time you are adding those supporting files. See official guide Amazon MWAA execution role for details if needed.

Set OPENLINEAGE_CLIENT_LOGGING to DEBUG in the Amazon MWAA start script for OpenLineage client and its child modules. This can be useful in troubleshooting during the configuration of openlineage provider.

For Astronomer customers using Astro, Astro offers lineage features that rely on the Airflow OpenLineage provider. Data Jobs Monitoring depends on the same OpenLineage provider and uses the Composite transport to add additional transport.

Requirements

Setup

To set up the OpenLineage provider, define the following environment variables. You can configure these variables in your Astronomer deployment using either of the following methods:
- From the Astro UI: Navigate to your deployment settings and add the environment variables directly.
- In the Dockerfile: Define the environment variables in your Dockerfile to ensure they are included during the build process.
```
OPENLINEAGE__TRANSPORT__TYPE=composite
OPENLINEAGE__TRANSPORT__TRANSPORTS__DATADOG__TYPE=http
OPENLINEAGE__TRANSPORT__TRANSPORTS__DATADOG__URL=<DD_DATA_OBSERVABILITY_INTAKE>
OPENLINEAGE__TRANSPORT__TRANSPORTS__DATADOG__AUTH__TYPE=api_key
OPENLINEAGE__TRANSPORT__TRANSPORTS__DATADOG__AUTH__API_KEY=<DD_API_KEY>
OPENLINEAGE__TRANSPORT__TRANSPORTS__DATADOG__COMPRESSION=gzip
```
- replace <DD_DATA_OBSERVABILITY_INTAKE> with https://data-obs-intake..
- replace <DD_API_KEY> with your valid Datadog API key.
Optional:
- Set AIRFLOW__OPENLINEAGE__NAMESPACE with a unique name for the env tag on all DAGs in the Airflow deployment. This allows Datadog to logically separate this deployment’s jobs from those of other Airflow deployments.
- Set OPENLINEAGE_CLIENT_LOGGING to DEBUG for the OpenLineage client and its child modules to log at a DEBUG logging level. This can be useful for troubleshooting during the configuration of an OpenLineage provider.
See the Astronomer official guide for managing environment variables for a deployment. See Apache Airflow’s OpenLineage Configuration Reference for other supported configurations of the OpenLineage provider.
Trigger a update to your deployment and wait for it to finish.

Validation

In Datadog, view the Data Jobs Monitoring page to see a list of your Airflow job runs after the setup.

Troubleshooting

Check that the OpenLineage environment variables are correctly set on the Astronomer deployment.

Note: Using the .env file to add the environment variables does not work because the variables are only applied to the local Airflow environment.

Data Jobs Monitoring for Airflow is not yet compatible with Dataplex data lineage. Setting up OpenLineage for Data Jobs Monitoring overrides your existing Dataplex transport configuration.

Requirements

Cloud Composer 2 or later
apache-airflow-providers-openlineage

Setup

To get started, follow the instructions below.

In the Advanced Configuration tab, under Airflow configuration override, click Add Airflow configuration override and configure these settings:
- In Section 1, enter openlineage.
- In Key 1, enter transport.
- In Value 1, enter the following:
```
{
 "type": "http", 
 "url": "<DD_DATA_OBSERVABILITY_INTAKE>", 
 "auth": {
    "type": "api_key", 
    "api_key": "<DD_API_KEY>"
 }
}
```
- Replace <DD_DATA_OBSERVABILITY_INTAKE> fully with https://data-obs-intake..
- Replace <DD_API_KEY> fully with your valid Datadog API key.
Check official Airflow and Composer documentation pages for other supported configurations of the openlineage provider in Google Cloud Composer.
After starting the Composer environment, install the openlineage provider by adding the following package in the Pypi packages tab of your environment page:
```
apache-airflow-providers-openlineage
```

Validation

In Datadog, view the Data Jobs Monitoring page to see a list of your Airflow job runs after the setup.

Troubleshooting

Set OPENLINEAGE_CLIENT_LOGGING to DEBUG in the Environment variables tab of the Composer page for OpenLineage client and its child modules. This can be useful in troubleshooting as you configure the openlineage provider.

Advanced Configuration

Link your dbt jobs with Airflow tasks

You can monitor your dbt jobs that are running in Airflow by connecting the dbt telemetry with respective Airflow tasks, using OpenLineage dbt integration.

To see the link between Airflow tasks and dbt jobs, follow those steps:

Install openlineage-dbt. Reference Using dbt with Amazon MWAA to setup dbt in the virtual environment.

pip3 install openlineage-dbt>=1.36.0

Change the dbt invocation to dbt-ol (OpenLineage wrapper for dbt).

Also, add the –consume-structured-logs flag to view dbt jobs while the command is still running.

dbt-ol run --consume-structured-logs --project-dir=$TEMP_DIR --profiles-dir=$PROFILES_DIR

In your DAG file, add the OPENLINEAGE_PARENT_ID variable to the environment of the Airflow task that runs the dbt process:

dbt_run = BashOperator(
    task_id="dbt_run",
    dag=dag,
    bash_command=f"dbt-ol run --consume-structured-logs --project-dir=$TEMP_DIR --profiles-dir=$PROFILES_DIR",
    append_env=True,
    env={
        "OPENLINEAGE_PARENT_ID": "{{ macros.OpenLineageProviderPlugin.lineage_parent_id(task_instance) }}",
    },
)

Link your Spark jobs with Airflow tasks

OpenLineage integration can automatically inject Airflow’s parent job information (namespace, job name, run id) into Spark application properties. This creates a parent-child relationship between Airflow tasks and Spark jobs, enabling you to troubleshoot both systems in one place.

Make sure your Spark jobs are currently monitored through Data Jobs Monitoring.
Enable automatic parent job information injection by setting the following configuration:

AIRFLOW__OPENLINEAGE__SPARK_INJECT_PARENT_JOB_INFO=true

This automatically injects parent job properties for all supported Spark Operators, like SparkSubmitOperator or LivyOperator. See the Apache Airflow OpenLineage documentation for the full list of supported operators. To disable for specific operators, set openlineage_inject_parent_job_info=False on the operator.