Google Cloud Composer

Overview

Google Cloud Composer is a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span across clouds and on-premises data centers.

Use the Datadog Google Cloud Platform integration to collect metrics from Google Cloud Composer.

Setup

Installation

If you haven’t already, set up the Google Cloud Platform integration first. There are no other installation steps.

Log collection

Google Cloud Composer logs are collected with Google Cloud Logging and sent to a Dataflow job through a Cloud Pub/Sub topic. If you haven’t already, set up logging with the Datadog Dataflow template.

Once this is done, export your Google Cloud Composer logs from Google Cloud Logging to the Pub/Sub topic:

  1. Go to the Google Cloud Logging page and filter the Google Cloud Composer logs.
  2. Click Create Export and name the sink.
  3. Choose “Cloud Pub/Sub” as the destination and select the Pub/Sub topic that was created for that purpose. Note: The Pub/Sub topic can be located in a different project.
  4. Click Create and wait for the confirmation message to show up.

Data Collected

Metrics

gcp.composer.environment.active_schedulers
(gauge)
Number of active scheduler instances.
gcp.composer.environment.active_triggerers
(gauge)
Number of active triggerers instances.
gcp.composer.environment.active_webservers
(gauge)
Number of active webserver instances.
gcp.composer.environment.api.request_count
(count)
Number of Composer API requests seen so far.
Shown as request
gcp.composer.environment.api.request_latencies.avg
(gauge)
Distribution of Composer API call latencies.
Shown as millisecond
gcp.composer.environment.api.request_latencies.samplecount
(count)
Sample count for API request latencies.
Shown as millisecond
gcp.composer.environment.api.request_latencies.sumsqdev
(gauge)
Sum of squared deviation for API request latencies.
Shown as second
gcp.composer.environment.celery.execute_command_failure_count
(count)
Cumulative number of non-zero exit codes from Celery task (corresponds to celery.execute_command.failure Airflow metric).
gcp.composer.environment.celery.task_timeout_error_count
(count)
Cumulative number of AirflowTaskTimeout errors raised when publishing Task to Celery Broker (corresponds to celery.task_timeout_error Airflow metric).
gcp.composer.environment.collect_db_dag_duration
(gauge)
Time taken for fetching all serialized DAGs from DB (corresponds to collect_db_dags Airflow metric).
Shown as millisecond
gcp.composer.environment.dag_callback.exception_count
(count)
Cumulative number of exceptions raised from DAG callbacks (corresponds to dag.callback_exceptions Airflow metric).
gcp.composer.environment.dag_file.refresh_error_count
(count)
Cumulative number of failures loading any DAG files (corresponds to dag_file_refresh_error Airflow metric).
gcp.composer.environment.dag_processing.last_duration
(gauge)
Time taken to load the given DAG file (corresponds to dag_processing.last_duration.<dag_file> Airflow metric).
Shown as millisecond
gcp.composer.environment.dag_processing.last_run_elapsed_time
(gauge)
Time since the DAG file was last processed (corresponds to dag_processing.last_run.seconds_ago.<dag_file> Airflow metric).
Shown as second
gcp.composer.environment.dag_processing.manager_stall_count
(count)
Cumulative number of DagFileProcessorManager stalls (corresponds to dag_processing.manager_stalls Airflow metric).
gcp.composer.environment.dag_processing.parse_error_count
(count)
Number of errors raised during parsing DAG files.
Shown as error
gcp.composer.environment.dag_processing.processes
(gauge)
Number of currently running DAG parsing processes.
Shown as process
gcp.composer.environment.dag_processing.processor_timeout_count
(count)
Number of file processors terminated due to processing timeout.
gcp.composer.environment.dag_processing.total_parse_time
(gauge)
Number of seconds taken to scan and import all DAG files once.
Shown as second
gcp.composer.environment.dagbag_size
(gauge)
The current DAG bag size.
gcp.composer.environment.database.airflow.size
(gauge)
Size of the Airflow metadata database.
Shown as byte
gcp.composer.environment.database.auto_failover_request_count
(count)
Cumulative number of instance auto-failover requests.
gcp.composer.environment.database.available_for_failover
(gauge)
True (value > 0) if Cloud SQL instance is enabled with HA and is ready for failover.
gcp.composer.environment.database.cpu.reserved_cores
(gauge)
Number of cores reserved for the database instance.
Shown as core
gcp.composer.environment.database.cpu.usage_time
(count)
CPU usage time of the database instance, in seconds.
Shown as second
gcp.composer.environment.database.cpu.utilization
(gauge)
CPU utilization ratio (from 0.0 to 1.0) of the database instance.
gcp.composer.environment.database.disk.bytes_used
(gauge)
Used disk space on the database instance, in bytes.
Shown as byte
gcp.composer.environment.database.disk.quota
(gauge)
Maximum data disk size of the database instance, in bytes.
Shown as byte
gcp.composer.environment.database.disk.utilization
(gauge)
Disk quota usage ratio (from 0.0 to 1.0) of the database instance.
gcp.composer.environment.database.memory.bytes_used
(gauge)
Memory usage of the database instance in bytes.
Shown as byte
gcp.composer.environment.database.memory.quota
(gauge)
Maximum RAM size of the database instance, in bytes.
Shown as byte
gcp.composer.environment.database.memory.utilization
(gauge)
Memory utilization ratio (from 0.0 to 1.0) of the database instance.
gcp.composer.environment.database.network.connections
(gauge)
Number of concurrent connections to the database instance.
gcp.composer.environment.database.network.max_connections
(gauge)
Maximum permitted number of concurrent connections to the database instance.
gcp.composer.environment.database.network.received_bytes_count
(count)
Number of bytes received by the database instance.
Shown as byte
gcp.composer.environment.database.network.sent_bytes_count
(count)
Number of bytes sent by the database instance.
Shown as byte
gcp.composer.environment.database_health
(gauge)
Health of Composer Airflow database.
gcp.composer.environment.database_retention.execution_durations.avg
(gauge)
The average distribution of cumulative durations of database retention job executions.
Shown as second
gcp.composer.environment.database_retention.execution_durations.samplecount
(gauge)
The sample count for distribution of cumulative durations of database retention job executions.
Shown as second
gcp.composer.environment.database_retention.execution_durations.sumsqdev
(gauge)
The sum of squared deviation for distribution of cumulative durations of database retention job executions.
Shown as second
gcp.composer.environment.database_retention.finished_execution_count
(count)
Cumulative number of database retention executions.
gcp.composer.environment.database_retention.retention_gap
(gauge)
How old data still needs trimming.
Shown as hour
gcp.composer.environment.email.sla_notification_failure_count
(count)
Number of failed SLA miss email notification attempts.
gcp.composer.environment.executor.open_slots
(gauge)
Number of open slots on executor.
gcp.composer.environment.executor.queued_tasks
(gauge)
Number of queued tasks on executor.
Shown as task
gcp.composer.environment.executor.running_tasks
(gauge)
Number of running tasks on executor.
Shown as task
gcp.composer.environment.finished_task_instance_count
(count)
Overall number of finished task instances.
Shown as instance
gcp.composer.environment.health.airflow_api_check_count
(count)
Cumulative number of Airflow API checks.
gcp.composer.environment.health.autoscaling_check_count
(count)
Cumulative number of autoscaling components checks.
gcp.composer.environment.health.cmek_encryption_check_count
(count)
Cumulative number of CMEK encryption checks.
gcp.composer.environment.health.container_restart_count
(count)
Cumulative number of container restarts.
gcp.composer.environment.health.dependency_check_count
(count)
Cumulative number of dependency checks.
gcp.composer.environment.health.dependency_permissions_check_count
(count)
Cumulative number of dependency permissions checks.
gcp.composer.environment.health.pod_event_count
(count)
Cumulative number of pod events.
gcp.composer.environment.health.redis_queue_check_count
(count)
Cumulative number of redis queue checks.
gcp.composer.environment.healthy
(gauge)
Health of Composer environment.
gcp.composer.environment.job.count
(count)
Cumulative number of started jobs, e.g. SchedulerJob, LocalTaskJob (corresponds to <job_name>_start, <job_name>_end Airflow metrics).
gcp.composer.environment.job.heartbeat_failure_count
(count)
Cumulative number of failed heartbeats for a job (corresponds to <job_name>_heartbeat_failure Airflow metric).
gcp.composer.environment.maintenance_operation
(gauge)
Information whether there is a maintenance operation of a given type.
gcp.composer.environment.num_celery_workers
(gauge)
Number of Celery workers.
Shown as worker
gcp.composer.environment.operator.created_task_instance_count
(count)
Cumulative number of created task instances per operator (corresponds to task_instance_created-<operator_name> Airflow metric).
gcp.composer.environment.operator.finished_task_instance_count
(count)
Cumulative number of finished task instances per operator (corresponds to operator_successes_<operator_name>, operator_failures_<operator_name> Airflow metrics).
gcp.composer.environment.pool.open_slots
(gauge)
Number of open slots in the pool.
gcp.composer.environment.pool.queued_slots
(gauge)
Number of queued slots in the pool (corresponds to pool.queued_slots.<pool_name> Airflow metric).
gcp.composer.environment.pool.running_slots
(gauge)
Number of running slots in the pool.
gcp.composer.environment.pool.starving_tasks
(gauge)
Number of starving tasks in the pool.
gcp.composer.environment.scheduler.critical_section_duration
(gauge)
Time spent in the critical section of the scheduler loop - only a single scheduler can enter this loop at a time (corresponds to scheduler.critical_section_duration Airflow metric).
Shown as millisecond
gcp.composer.environment.scheduler.critical_section_lock_failure_count
(count)
Cumulative number of times a scheduler process tried to get a lock on the critical section - in order to send tasks to the executor - and found it locked by another process (corresponds to scheduler.critical_section_busy Airflow metric).
gcp.composer.environment.scheduler.pod_eviction_count
(count)
The number of Airflow scheduler pod evictions.
gcp.composer.environment.scheduler.task.externally_killed_count
(count)
Cumulative number of tasks killed externally (corresponds to scheduler.tasks.killed_externally Airflow metric).
gcp.composer.environment.scheduler.task.orphan_count
(count)
Cumulative number of cleared/adopted orphaned tasks (corresponds to scheduler.orphaned_tasks.cleared, scheduler.orphaned_tasks.adopted Airflow metrics).
gcp.composer.environment.scheduler.tasks
(gauge)
Number of tasks managed by scheduler (corresponds to scheduler.tasks.running, scheduler.tasks.starving, scheduler.tasks.executable Airflow metrics).
gcp.composer.environment.scheduler_heartbeat_count
(count)
Scheduler heartbeats.
gcp.composer.environment.sla_callback_notification_failure_count
(count)
Cumulative number of failed SLA miss callback notification attempts (corresponds to sla_callback_notification_failure Airflow metric).
gcp.composer.environment.smart_sensor.exception_failures
(gauge)
Number of failures caused by exception in the previous smart sensor poking loop.
gcp.composer.environment.smart_sensor.infra_failures
(gauge)
Number of infrastructure failures in the previous smart sensor poking loop.
gcp.composer.environment.smart_sensor.poked_exception
(gauge)
Number of exceptions in the previous smart sensor poking loop.
gcp.composer.environment.smart_sensor.poked_success
(gauge)
Number of newly succeeded tasks poked by the smart sensor in the previous poking loop.
gcp.composer.environment.smart_sensor.poked_tasks
(gauge)
Number of tasks poked by the smart sensor in the previous poking loop.
gcp.composer.environment.snapshot.creation_count
(count)
Number of created scheduled snapshots.
gcp.composer.environment.snapshot.creation_elapsed_time
(gauge)
Time elapsed of the last scheduled snapshot creation.
Shown as second
gcp.composer.environment.snapshot.size
(gauge)
Size of last scheduled snapshot in bytes.
Shown as byte
gcp.composer.environment.task_instance.previously_succeeded_count
(count)
Cumulative number of times a task instance was already in SUCCESS state before execution (corresponds to previously_succeeded Airflow metric).
gcp.composer.environment.task_queue_length
(gauge)
Number of tasks in queue.
Shown as task
gcp.composer.environment.trigger.blocking_count
(count)
Total number of triggers that blocked the main thread of a triggerer.
gcp.composer.environment.trigger.failed_count
(count)
Total number of triggers that failed.
gcp.composer.environment.trigger.succeeded_count
(count)
Total number of triggers that succeeded.
gcp.composer.environment.unfinished_task_instances
(gauge)
Overall task instances in not finished state.
Shown as instance
gcp.composer.environment.web_server.cpu.reserved_cores
(gauge)
Number of cores reserved for the web server instance.
Shown as core
gcp.composer.environment.web_server.cpu.usage_time
(count)
CPU usage time of the web server instance, in seconds.
Shown as second
gcp.composer.environment.web_server.health
(gauge)
Healthiness of Airflow web server.
gcp.composer.environment.web_server.memory.bytes_used
(gauge)
Memory usage of the web server instance in bytes.
Shown as byte
gcp.composer.environment.web_server.memory.quota
(gauge)
Maximum RAM size of the web server instance, in bytes.
Shown as byte
gcp.composer.environment.worker.max_workers
(gauge)
Maximum number of Airflow workers.
Shown as worker
gcp.composer.environment.worker.min_workers
(gauge)
Minimum number of Airflow workers.
Shown as worker
gcp.composer.environment.worker.pod_eviction_count
(count)
Number of Airflow worker pods evictions.
Shown as eviction
gcp.composer.environment.worker.scale_factor_target
(gauge)
Scale factor for Airflow workers count.
gcp.composer.environment.zombie_task_killed_count
(count)
Number of zombie tasks killed.
Shown as task
gcp.composer.workflow.dag.run_duration
(gauge)
Time taken for a DAG run to reach terminal state (corresponds to dagrun.duration.success.<dag_id>, dagrun.duration.failed.<dag_id> Airflow metrics).
Shown as millisecond
gcp.composer.workflow.dependency_check_duration
(gauge)
Time taken to check DAG dependencies (corresponds to dagrun.dependency-check.<dag_id> Airflow metric).
Shown as millisecond
gcp.composer.workflow.run_count
(count)
Number of workflow runs completed so far.
gcp.composer.workflow.run_duration
(gauge)
Duration of workflow run completion.
Shown as second
gcp.composer.workflow.schedule_delay
(gauge)
Delay between the scheduled DagRun start date and the actual DagRun start date (corresponds to dagrun.schedule_delay.<dag_id> Airflow metric).
Shown as millisecond
gcp.composer.workflow.task.log_file_size
(gauge)
Size of log file generated by workflow task in bytes.
Shown as byte
gcp.composer.workflow.task.removed_from_dag_count
(count)
Cumulative number of tasks removed for a given DAG, i.e. task no longer exists in DAG (corresponds to task_removed_from_dag.<dag_id> Airflow metric).
gcp.composer.workflow.task.restored_to_dag_count
(count)
Cumulative number of tasks restored for a given DAG, i.e. task instance which was previously in REMOVED state in the DB is added to DAG file (corresponds to task_restored_to_dag.<dag_id> Airflow metric).
gcp.composer.workflow.task.run_count
(count)
Number of workflow tasks completed so far.
Shown as task
gcp.composer.workflow.task.run_duration
(gauge)
Duration of task completion.
Shown as second
gcp.composer.workflow.task.schedule_delay
(gauge)
Time elapsed between the first task start_date and DagRun expected start (corresponds to dagrun.<dag_id>.first_task_scheduling_delay Airflow metric).
Shown as millisecond
gcp.composer.workflow.task_instance.finished_count
(count)
Cumulative number of finished task instances (corresponds to ti.finish.<dag_id>.<task_id>.<state> Airflow metric).
gcp.composer.workflow.task_instance.queued_duration
(gauge)
Time taken in queued state (corresponds to dag.<dag_id>.<task_id>.queued_duration Airflow metric).
Shown as millisecond
gcp.composer.workflow.task_instance.run_duration
(gauge)
Time taken to finish a task (corresponds to dag.<dag_id>.<task_id>.duration Airflow metric).
Shown as millisecond
gcp.composer.workflow.task_instance.started_count
(count)
Cumulative number of tasks started in a given DAG (corresponds to ti.start.<dag_id>.<task_id> Airflow metric).
gcp.composer.workflow.task_runner.terminated_count
(count)
Number of workflow tasks where the task runner got terminated with a return code.
gcp.composer.workload.cpu.reserved_cores
(gauge)
Number of cores reserved for the workload instance.
gcp.composer.workload.cpu.usage_time
(count)
CPU usage time of the workload instance.
Shown as second
gcp.composer.workload.disk.bytes_used
(gauge)
Used disk space in bytes on the workload instance.
Shown as byte
gcp.composer.workload.disk.quota
(gauge)
Maximum data disk size in bytes of the workload instance.
Shown as byte
gcp.composer.workload.log_entry_count
(count)
Cumulative number of log occurrences with a specified severity level.
gcp.composer.workload.memory.bytes_used
(gauge)
Memory usage of the workload instance in bytes.
Shown as byte
gcp.composer.workload.memory.quota
(gauge)
Maximum RAM size in bytes of the workload instance.
Shown as byte
gcp.composer.workload.restart_count
(count)
Cumulative number of workload restarts.
gcp.composer.workload.trigger.num_running
(gauge)
Number of running triggers in a triggerer.
gcp.composer.workload.uptime
(gauge)
Time since workload created.
Shown as second

Events

The Google Cloud Composer integration does not include any events.

Service Checks

The Google Cloud Composer integration does not include any service checks.

Troubleshooting

Need help? Contact Datadog support.