Google Cloud Composer

개요

Google Cloud Composer는 클라우드 및 온프레미스 데이터센터 전반에 걸쳐 파이프라인을 작성, 예약 및 모니터링할 수 있도록 도와드리는 완전관리형 워크플로우 오케스트레이션 서비스입니다.

Datadog Google Cloud Platform 통합을 사용하여 Google Cloud Composer에서 메트릭을 수집합니다.

설정

설치

아직 설치하지 않았다면 먼저 Google 클라우드 플랫폼 통합을 설정합니다. 그 외 다른 설치가 필요하지 않습니다.

로그 수집

Google Cloud Composer 로그는 Google Cloud Logging으로 수집하여 클라우드 Pub/Sub 토픽을 통해 데이터 플로우 작업으로 전송됩니다. 아직 설정하지 않았다면 Datadog 데이터 플로우 템플릿으로 로깅을 설정하세요.

해당 작업이 완료되면 Google Cloud Logging에서 Google Cloud Composer 로그를 다음 Pub/Sub 주제로 내보냅니다.

  1. Google Cloud Logging 페이지로 이동해 Google Cloud Composer 로그를 필터링하세요.
  2. Create Export를 클릭하고 싱크 이름을 지정하세요.
  3. “Cloud Pub/Sub"를 대상으로 선택하고 해당 목적으로 생성된 Pub/Sub 주제를 선택합니다. 참고: Pub/Sub 주제는 다른 프로젝트에 있을 수 있습니다.
  4. Create를 클릭하고 확인 메시지가 나타날 때까지 기다립니다.

수집한 데이터

메트릭

gcp.composer.environment.active_schedulers
(gauge)
Number of active scheduler instances.
gcp.composer.environment.active_triggerers
(gauge)
Number of active triggerers instances.
gcp.composer.environment.active_webservers
(gauge)
Number of active webserver instances.
gcp.composer.environment.api.request_count
(count)
Number of Composer API requests seen so far.
Shown as request
gcp.composer.environment.api.request_latencies.avg
(gauge)
Distribution of Composer API call latencies.
Shown as millisecond
gcp.composer.environment.api.request_latencies.samplecount
(count)
Sample count for API request latencies.
Shown as millisecond
gcp.composer.environment.api.request_latencies.sumsqdev
(gauge)
Sum of squared deviation for API request latencies.
Shown as second
gcp.composer.environment.celery.execute_command_failure_count
(count)
Cumulative number of non-zero exit codes from Celery task (corresponds to celery.execute_command.failure Airflow metric).
gcp.composer.environment.celery.task_timeout_error_count
(count)
Cumulative number of AirflowTaskTimeout errors raised when publishing Task to Celery Broker (corresponds to celery.task_timeout_error Airflow metric).
gcp.composer.environment.collect_db_dag_duration
(gauge)
Time taken for fetching all serialized DAGs from DB (corresponds to collect_db_dags Airflow metric).
Shown as millisecond
gcp.composer.environment.dag_callback.exception_count
(count)
Cumulative number of exceptions raised from DAG callbacks (corresponds to dag.callback_exceptions Airflow metric).
gcp.composer.environment.dag_file.refresh_error_count
(count)
Cumulative number of failures loading any DAG files (corresponds to dag_file_refresh_error Airflow metric).
gcp.composer.environment.dag_processing.last_duration
(gauge)
Time taken to load the given DAG file (corresponds to dag_processing.last_duration.<dag_file> Airflow metric).
Shown as millisecond
gcp.composer.environment.dag_processing.last_run_elapsed_time
(gauge)
Time since the DAG file was last processed (corresponds to dag_processing.last_run.seconds_ago.<dag_file> Airflow metric).
Shown as second
gcp.composer.environment.dag_processing.manager_stall_count
(count)
Cumulative number of DagFileProcessorManager stalls (corresponds to dag_processing.manager_stalls Airflow metric).
gcp.composer.environment.dag_processing.parse_error_count
(count)
Number of errors raised during parsing DAG files.
Shown as error
gcp.composer.environment.dag_processing.processes
(gauge)
Number of currently running DAG parsing processes.
Shown as process
gcp.composer.environment.dag_processing.processor_timeout_count
(count)
Number of file processors terminated due to processing timeout.
gcp.composer.environment.dag_processing.total_parse_time
(gauge)
Number of seconds taken to scan and import all DAG files once.
Shown as second
gcp.composer.environment.dagbag_size
(gauge)
The current DAG bag size.
gcp.composer.environment.database.airflow.size
(gauge)
Size of the Airflow metadata database.
Shown as byte
gcp.composer.environment.database.auto_failover_request_count
(count)
Cumulative number of instance auto-failover requests.
gcp.composer.environment.database.available_for_failover
(gauge)
True (value > 0) if Cloud SQL instance is enabled with HA and is ready for failover.
gcp.composer.environment.database.cpu.reserved_cores
(gauge)
Number of cores reserved for the database instance.
Shown as core
gcp.composer.environment.database.cpu.usage_time
(count)
CPU usage time of the database instance, in seconds.
Shown as second
gcp.composer.environment.database.cpu.utilization
(gauge)
CPU utilization ratio (from 0.0 to 1.0) of the database instance.
gcp.composer.environment.database.disk.bytes_used
(gauge)
Used disk space on the database instance, in bytes.
Shown as byte
gcp.composer.environment.database.disk.quota
(gauge)
Maximum data disk size of the database instance, in bytes.
Shown as byte
gcp.composer.environment.database.disk.utilization
(gauge)
Disk quota usage ratio (from 0.0 to 1.0) of the database instance.
gcp.composer.environment.database.memory.bytes_used
(gauge)
Memory usage of the database instance in bytes.
Shown as byte
gcp.composer.environment.database.memory.quota
(gauge)
Maximum RAM size of the database instance, in bytes.
Shown as byte
gcp.composer.environment.database.memory.utilization
(gauge)
Memory utilization ratio (from 0.0 to 1.0) of the database instance.
gcp.composer.environment.database.network.connections
(gauge)
Number of concurrent connections to the database instance.
gcp.composer.environment.database.network.max_connections
(gauge)
Maximum permitted number of concurrent connections to the database instance.
gcp.composer.environment.database.network.received_bytes_count
(count)
Number of bytes received by the database instance.
Shown as byte
gcp.composer.environment.database.network.sent_bytes_count
(count)
Number of bytes sent by the database instance.
Shown as byte
gcp.composer.environment.database_health
(gauge)
Health of Composer Airflow database.
gcp.composer.environment.database_retention.execution_durations.avg
(gauge)
The average distribution of cumulative durations of database retention job executions.
Shown as second
gcp.composer.environment.database_retention.execution_durations.samplecount
(gauge)
The sample count for distribution of cumulative durations of database retention job executions.
Shown as second
gcp.composer.environment.database_retention.execution_durations.sumsqdev
(gauge)
The sum of squared deviation for distribution of cumulative durations of database retention job executions.
Shown as second
gcp.composer.environment.database_retention.finished_execution_count
(count)
Cumulative number of database retention executions.
gcp.composer.environment.database_retention.retention_gap
(gauge)
How old data still needs trimming.
Shown as hour
gcp.composer.environment.email.sla_notification_failure_count
(count)
Number of failed SLA miss email notification attempts.
gcp.composer.environment.executor.open_slots
(gauge)
Number of open slots on executor.
gcp.composer.environment.executor.queued_tasks
(gauge)
Number of queued tasks on executor.
Shown as task
gcp.composer.environment.executor.running_tasks
(gauge)
Number of running tasks on executor.
Shown as task
gcp.composer.environment.finished_task_instance_count
(count)
Overall number of finished task instances.
Shown as instance
gcp.composer.environment.health.airflow_api_check_count
(count)
Cumulative number of Airflow API checks.
gcp.composer.environment.health.autoscaling_check_count
(count)
Cumulative number of autoscaling components checks.
gcp.composer.environment.health.cmek_encryption_check_count
(count)
Cumulative number of CMEK encryption checks.
gcp.composer.environment.health.container_restart_count
(count)
Cumulative number of container restarts.
gcp.composer.environment.health.dependency_check_count
(count)
Cumulative number of dependency checks.
gcp.composer.environment.health.dependency_permissions_check_count
(count)
Cumulative number of dependency permissions checks.
gcp.composer.environment.health.pod_event_count
(count)
Cumulative number of pod events.
gcp.composer.environment.health.redis_queue_check_count
(count)
Cumulative number of redis queue checks.
gcp.composer.environment.healthy
(gauge)
Health of Composer environment.
gcp.composer.environment.job.count
(count)
Cumulative number of started jobs, e.g. SchedulerJob, LocalTaskJob (corresponds to <job_name>_start, <job_name>_end Airflow metrics).
gcp.composer.environment.job.heartbeat_failure_count
(count)
Cumulative number of failed heartbeats for a job (corresponds to <job_name>_heartbeat_failure Airflow metric).
gcp.composer.environment.maintenance_operation
(gauge)
Information whether there is a maintenance operation of a given type.
gcp.composer.environment.num_celery_workers
(gauge)
Number of Celery workers.
Shown as worker
gcp.composer.environment.operator.created_task_instance_count
(count)
Cumulative number of created task instances per operator (corresponds to task_instance_created-<operator_name> Airflow metric).
gcp.composer.environment.operator.finished_task_instance_count
(count)
Cumulative number of finished task instances per operator (corresponds to operator_successes_<operator_name>, operator_failures_<operator_name> Airflow metrics).
gcp.composer.environment.pool.open_slots
(gauge)
Number of open slots in the pool.
gcp.composer.environment.pool.queued_slots
(gauge)
Number of queued slots in the pool (corresponds to pool.queued_slots.<pool_name> Airflow metric).
gcp.composer.environment.pool.running_slots
(gauge)
Number of running slots in the pool.
gcp.composer.environment.pool.starving_tasks
(gauge)
Number of starving tasks in the pool.
gcp.composer.environment.scheduler.critical_section_duration
(gauge)
Time spent in the critical section of the scheduler loop - only a single scheduler can enter this loop at a time (corresponds to scheduler.critical_section_duration Airflow metric).
Shown as millisecond
gcp.composer.environment.scheduler.critical_section_lock_failure_count
(count)
Cumulative number of times a scheduler process tried to get a lock on the critical section - in order to send tasks to the executor - and found it locked by another process (corresponds to scheduler.critical_section_busy Airflow metric).
gcp.composer.environment.scheduler.pod_eviction_count
(count)
The number of Airflow scheduler pod evictions.
gcp.composer.environment.scheduler.task.externally_killed_count
(count)
Cumulative number of tasks killed externally (corresponds to scheduler.tasks.killed_externally Airflow metric).
gcp.composer.environment.scheduler.task.orphan_count
(count)
Cumulative number of cleared/adopted orphaned tasks (corresponds to scheduler.orphaned_tasks.cleared, scheduler.orphaned_tasks.adopted Airflow metrics).
gcp.composer.environment.scheduler.tasks
(gauge)
Number of tasks managed by scheduler (corresponds to scheduler.tasks.running, scheduler.tasks.starving, scheduler.tasks.executable Airflow metrics).
gcp.composer.environment.scheduler_heartbeat_count
(count)
Scheduler heartbeats.
gcp.composer.environment.sla_callback_notification_failure_count
(count)
Cumulative number of failed SLA miss callback notification attempts (corresponds to sla_callback_notification_failure Airflow metric).
gcp.composer.environment.smart_sensor.exception_failures
(gauge)
Number of failures caused by exception in the previous smart sensor poking loop.
gcp.composer.environment.smart_sensor.infra_failures
(gauge)
Number of infrastructure failures in the previous smart sensor poking loop.
gcp.composer.environment.smart_sensor.poked_exception
(gauge)
Number of exceptions in the previous smart sensor poking loop.
gcp.composer.environment.smart_sensor.poked_success
(gauge)
Number of newly succeeded tasks poked by the smart sensor in the previous poking loop.
gcp.composer.environment.smart_sensor.poked_tasks
(gauge)
Number of tasks poked by the smart sensor in the previous poking loop.
gcp.composer.environment.snapshot.creation_count
(count)
Number of created scheduled snapshots.
gcp.composer.environment.snapshot.creation_elapsed_time
(gauge)
Time elapsed of the last scheduled snapshot creation.
Shown as second
gcp.composer.environment.snapshot.size
(gauge)
Size of last scheduled snapshot in bytes.
Shown as byte
gcp.composer.environment.task_instance.previously_succeeded_count
(count)
Cumulative number of times a task instance was already in SUCCESS state before execution (corresponds to previously_succeeded Airflow metric).
gcp.composer.environment.task_queue_length
(gauge)
Number of tasks in queue.
Shown as task
gcp.composer.environment.trigger.blocking_count
(count)
Total number of triggers that blocked the main thread of a triggerer.
gcp.composer.environment.trigger.failed_count
(count)
Total number of triggers that failed.
gcp.composer.environment.trigger.succeeded_count
(count)
Total number of triggers that succeeded.
gcp.composer.environment.unfinished_task_instances
(gauge)
Overall task instances in not finished state.
Shown as instance
gcp.composer.environment.web_server.cpu.reserved_cores
(gauge)
Number of cores reserved for the web server instance.
Shown as core
gcp.composer.environment.web_server.cpu.usage_time
(count)
CPU usage time of the web server instance, in seconds.
Shown as second
gcp.composer.environment.web_server.health
(gauge)
Healthiness of Airflow web server.
gcp.composer.environment.web_server.memory.bytes_used
(gauge)
Memory usage of the web server instance in bytes.
Shown as byte
gcp.composer.environment.web_server.memory.quota
(gauge)
Maximum RAM size of the web server instance, in bytes.
Shown as byte
gcp.composer.environment.worker.max_workers
(gauge)
Maximum number of Airflow workers.
Shown as worker
gcp.composer.environment.worker.min_workers
(gauge)
Minimum number of Airflow workers.
Shown as worker
gcp.composer.environment.worker.pod_eviction_count
(count)
Number of Airflow worker pods evictions.
Shown as eviction
gcp.composer.environment.worker.scale_factor_target
(gauge)
Scale factor for Airflow workers count.
gcp.composer.environment.zombie_task_killed_count
(count)
Number of zombie tasks killed.
Shown as task
gcp.composer.workflow.dag.run_duration
(gauge)
Time taken for a DAG run to reach terminal state (corresponds to dagrun.duration.success.<dag_id>, dagrun.duration.failed.<dag_id> Airflow metrics).
Shown as millisecond
gcp.composer.workflow.dependency_check_duration
(gauge)
Time taken to check DAG dependencies (corresponds to dagrun.dependency-check.<dag_id> Airflow metric).
Shown as millisecond
gcp.composer.workflow.run_count
(count)
Number of workflow runs completed so far.
gcp.composer.workflow.run_duration
(gauge)
Duration of workflow run completion.
Shown as second
gcp.composer.workflow.schedule_delay
(gauge)
Delay between the scheduled DagRun start date and the actual DagRun start date (corresponds to dagrun.schedule_delay.<dag_id> Airflow metric).
Shown as millisecond
gcp.composer.workflow.task.log_file_size
(gauge)
Size of log file generated by workflow task in bytes.
Shown as byte
gcp.composer.workflow.task.removed_from_dag_count
(count)
Cumulative number of tasks removed for a given DAG, i.e. task no longer exists in DAG (corresponds to task_removed_from_dag.<dag_id> Airflow metric).
gcp.composer.workflow.task.restored_to_dag_count
(count)
Cumulative number of tasks restored for a given DAG, i.e. task instance which was previously in REMOVED state in the DB is added to DAG file (corresponds to task_restored_to_dag.<dag_id> Airflow metric).
gcp.composer.workflow.task.run_count
(count)
Number of workflow tasks completed so far.
Shown as task
gcp.composer.workflow.task.run_duration
(gauge)
Duration of task completion.
Shown as second
gcp.composer.workflow.task.schedule_delay
(gauge)
Time elapsed between the first task start_date and DagRun expected start (corresponds to dagrun.<dag_id>.first_task_scheduling_delay Airflow metric).
Shown as millisecond
gcp.composer.workflow.task_instance.finished_count
(count)
Cumulative number of finished task instances (corresponds to ti.finish.<dag_id>.<task_id>.<state> Airflow metric).
gcp.composer.workflow.task_instance.queued_duration
(gauge)
Time taken in queued state (corresponds to dag.<dag_id>.<task_id>.queued_duration Airflow metric).
Shown as millisecond
gcp.composer.workflow.task_instance.run_duration
(gauge)
Time taken to finish a task (corresponds to dag.<dag_id>.<task_id>.duration Airflow metric).
Shown as millisecond
gcp.composer.workflow.task_instance.started_count
(count)
Cumulative number of tasks started in a given DAG (corresponds to ti.start.<dag_id>.<task_id> Airflow metric).
gcp.composer.workflow.task_runner.terminated_count
(count)
Number of workflow tasks where the task runner got terminated with a return code.
gcp.composer.workload.cpu.reserved_cores
(gauge)
Number of cores reserved for the workload instance.
gcp.composer.workload.cpu.usage_time
(count)
CPU usage time of the workload instance.
Shown as second
gcp.composer.workload.disk.bytes_used
(gauge)
Used disk space in bytes on the workload instance.
Shown as byte
gcp.composer.workload.disk.quota
(gauge)
Maximum data disk size in bytes of the workload instance.
Shown as byte
gcp.composer.workload.log_entry_count
(count)
Cumulative number of log occurrences with a specified severity level.
gcp.composer.workload.memory.bytes_used
(gauge)
Memory usage of the workload instance in bytes.
Shown as byte
gcp.composer.workload.memory.quota
(gauge)
Maximum RAM size in bytes of the workload instance.
Shown as byte
gcp.composer.workload.restart_count
(count)
Cumulative number of workload restarts.
gcp.composer.workload.trigger.num_running
(gauge)
Number of running triggers in a triggerer.
gcp.composer.workload.uptime
(gauge)
Time since workload created.
Shown as second

이벤트

Google Cloud Composer 통합은 이벤트를 포함하지 않습니다.

서비스 점검

Google Cloud Composer 통합은 서비스 점검을 포함하지 않습니다.

트러블슈팅

도움이 필요하신가요? Datadog 지원팀에 문의하세요.