Google Cloud Dataflow

개요

Google Cloud Dataflow는 스트림(실시간) 및 배치(기록) 모드에서 동일한 안정성과 표현 능력으로 데이터를 변환 및 보강할 수 있는 완전관리형 서비스입니다.

Datadog Google Cloud 통합을 사용하여 Google Cloud Dataflow에서 메트릭을 수집합니다.

설정

메트릭 수집

설치

아직 설치하지 않았다면 먼저 Google 클라우드 플랫폼 통합을 설정합니다. 그 외 다른 설치가 필요하지 않습니다.

로그 수집

Google Cloud Dataflow 로그는 Google Cloud Logging으로 수집하여 클라우드 Pub/Sub 토픽을 통해 데이터 플로우 작업으로 전송됩니다. 아직 설정하지 않았다면 Datadog 데이터 플로우 템플릿으로 로깅을 설정하세요.

해당 작업이 완료되면 Google Cloud Logging에서 Google Cloud Dataflow 로그를 다음 Pub/Sub 주제로 내보냅니다.

  1. Google Cloud Logging 페이지로 이동해 Google Cloud Dataflow 로그를 필터링하세요.
  2. Create Sink를 클릭하고 그에 따라 싱크 이름을 지정합니다.
  3. “Cloud Pub/Sub"를 대상으로 선택하고 해당 목적으로 생성된 Pub/Sub 주제를 선택합니다. 참고: Pub/Sub 주제는 다른 프로젝트에 있을 수 있습니다.
  4. Create를 클릭하고 확인 메시지가 나타날 때까지 기다립니다.

수집한 데이터

메트릭

gcp.dataflow.job.active_worker_instances
(gauge)
The active number of worker instances.
gcp.dataflow.job.aggregated_worker_utilization
(gauge)
Aggregated worker utilization (for example, worker CPU utilization) across the worker pool.
Shown as percent
gcp.dataflow.job.backlog_bytes
(gauge)
Amount of known, unprocessed input for a stage, in bytes.
Shown as byte
gcp.dataflow.job.backlog_elements
(gauge)
Amount of known, unprocessed input for a stage, in elements.
gcp.dataflow.job.bigquery.write_count
(count)
BigQuery write requests from BigQueryIO.Write in Dataflow jobs.
gcp.dataflow.job.billable_shuffle_data_processed
(gauge)
The billable bytes of shuffle data processed by this Dataflow job.
Shown as byte
gcp.dataflow.job.bundle_user_processing_latencies.avg
(gauge)
The average bundle user processing latencies from a particular stage. Available for jobs running on Streaming Engine.
Shown as millisecond
gcp.dataflow.job.bundle_user_processing_latencies.samplecount
(gauge)
The sample count for bundle user processing latencies from a particular stage. Available for jobs running on Streaming Engine.
Shown as millisecond
gcp.dataflow.job.bundle_user_processing_latencies.sumsqdev
(gauge)
The sum of squared deviation for bundle user processing latencies from a particular stage. Available for jobs running on Streaming Engine.
Shown as millisecond
gcp.dataflow.job.current_num_vcpus
(gauge)
The number of vCPUs currently being used by this Dataflow job.
Shown as cpu
gcp.dataflow.job.current_shuffle_slots
(gauge)
The current shuffle slots used by this Dataflow job.
gcp.dataflow.job.data_watermark_age
(gauge)
The age (time since event timestamp) of the most recent item of data that has been fully processed by the pipeline.
Shown as second
gcp.dataflow.job.disk_space_capacity
(gauge)
The amount of persistent disk currently being allocated to all workers associated with this Dataflow job.
Shown as byte
gcp.dataflow.job.dofn_latency_average
(gauge)
The average processing time for a single message in a given DoFn (over the past 3 min window). Note that this includes time spent in GetData calls.
Shown as millisecond
gcp.dataflow.job.dofn_latency_max
(gauge)
The maximum processing time for a single message in a given DoFn (over the past 3 min window). Note that this includes time spent in GetData calls.
Shown as millisecond
gcp.dataflow.job.dofn_latency_min
(gauge)
The minimum processing time for a single message in a given DoFn (over the past 3 min window).
Shown as millisecond
gcp.dataflow.job.dofn_latency_num_messages
(gauge)
The number of messages processed by a given DoFn (over the past 3 min window).
gcp.dataflow.job.dofn_latency_total
(gauge)
The total processing time for all messages in a given DoFn (over the past 3 min window). Note that this includes time spent in GetData calls.
Shown as millisecond
gcp.dataflow.job.duplicates_filtered_out_count
(count)
The number of messages being processed by a particular stage that have been filtered out as duplicates. Available for jobs running on Streaming Engine.
gcp.dataflow.job.elapsed_time
(gauge)
Duration that the current run of this pipeline has been in the Running state so far, in seconds. When a run completes, this stays at the duration of that run until the next run starts.
Shown as second
gcp.dataflow.job.element_count
(count)
Number of elements added to the PCollection so far.
Shown as item
gcp.dataflow.job.elements_produced_count
(count)
The number of elements produced by each PTransform.
gcp.dataflow.job.estimated_backlog_processing_time
(gauge)
Estimated time (in seconds) to consume current backlog if no new data comes in and throughput stays the same. Only available for Streaming Engine jobs.
Shown as second
gcp.dataflow.job.estimated_byte_count
(count)
An estimated number of bytes added to the PCollection so far.
Shown as byte
gcp.dataflow.job.estimated_bytes_active
(gauge)
Estimated number of bytes active in this stage of the job.
Shown as byte
gcp.dataflow.job.estimated_bytes_consumed_count
(count)
Estimated number of bytes consumed by the stage of this job.
Shown as byte
gcp.dataflow.job.estimated_bytes_produced_count
(count)
The estimated total byte size of elements produced by each PTransform.
gcp.dataflow.job.estimated_timer_backlog_processing_time
(gauge)
Estimated time (in seconds) for timers to complete. Only available for Streaming Engine jobs.
Shown as second
gcp.dataflow.job.gpu_memory_utilization
(gauge)
Percent of time over the past sample period during which global (device) memory was being read or written.
Shown as percent
gcp.dataflow.job.gpu_utilization
(gauge)
Percent of time over the past sample period during which one or more kernels was executing on the GPU.
Shown as percent
gcp.dataflow.job.horizontal_worker_scaling
(gauge)
A boolean value that indicates what kind of horizontal scaling direction which the autoscaler recommended and rationale behind it. A true metric output means a scaling decision is made and a false metric output means the corresponding scaling is not taking effect.
gcp.dataflow.job.is_failed
(gauge)
Has this job failed.
gcp.dataflow.job.max_worker_instances_limit
(gauge)
The maximum number of workers autoscaling is allowed to request.
gcp.dataflow.job.memory_capacity
(gauge)
The amount of memory currently being allocated to all workers associated with this Dataflow job.
Shown as byte
gcp.dataflow.job.min_worker_instances_limit
(gauge)
The minimum number of workers autoscaling is allowed to request.
gcp.dataflow.job.oldest_active_message_age
(gauge)
How long the oldest active message in a DoFn has been processing for.
Shown as millisecond
gcp.dataflow.job.per_stage_data_watermark_age
(gauge)
The age (time since event timestamp) up to which all data has been processed by this stage of the pipeline.
Shown as second
gcp.dataflow.job.per_stage_system_lag
(gauge)
The current maximum duration that an item of data has been processing or awaiting processing in seconds, per pipeline stage.
Shown as second
gcp.dataflow.job.processing_parallelism_keys
(gauge)
Approximate number of keys in use for data processing for each stage. Processing for any given key is serialized, so the total number of keys for a stage represents the maximum available parallelism at that stage. Available for jobs running on Streaming Engine.
gcp.dataflow.job.pubsub.late_messages_count
(count)
The number of messages from Pub/Sub with timestamp older than the estimated watermark.
gcp.dataflow.job.pubsub.published_messages_count
(count)
The number of Pub/Sub messages published broken down by topic and status.
gcp.dataflow.job.pubsub.pulled_message_ages.avg
(gauge)
The average the distribution of pulled but unacked Pub/Sub message ages.
Shown as millisecond
gcp.dataflow.job.pubsub.pulled_message_ages.samplecount
(gauge)
The sample count for the distribution of pulled but unacked Pub/Sub message ages.
Shown as millisecond
gcp.dataflow.job.pubsub.pulled_message_ages.sumsqdev
(gauge)
The sum of squared deviation for the distribution of pulled but unacked Pub/Sub message ages.
Shown as millisecond
gcp.dataflow.job.pubsub.read_count
(count)
Pub/Sub Pull Requests. For Streaming Engine, this metric is deprecated. See the Using the Dataflow monitoring interface page for upcoming changes.
gcp.dataflow.job.pubsub.read_latencies.avg
(count)
The average Pub/Sub Pull request latencies from PubsubIO.Read in Dataflow jobs. For Streaming Engine, this metric is deprecated. See the Using the Dataflow monitoring interface page for upcoming changes.
Shown as millisecond
gcp.dataflow.job.pubsub.read_latencies.samplecount
(count)
The sample count for Pub/Sub Pull request latencies from PubsubIO.Read in Dataflow jobs. For Streaming Engine, this metric is deprecated. See the Using the Dataflow monitoring interface page for upcoming changes.
Shown as millisecond
gcp.dataflow.job.pubsub.read_latencies.sumsqdev
(count)
The sum of squared deviation for Pub/Sub Pull request latencies from PubsubIO.Read in Dataflow jobs. For Streaming Engine, this metric is deprecated. See the Using the Dataflow monitoring interface page for upcoming changes.
Shown as millisecond
gcp.dataflow.job.pubsub.streaming_pull_connection_status
(gauge)
Percentage of all Streaming Pull connections that are either active (OK status) or terminated because of an error (non-OK status). When a connection is terminated, Dataflow will wait some time before attempting to re-connect. For Streaming Engine only.
Shown as percent
gcp.dataflow.job.pubsub.write_count
(count)
Pub/Sub Publish requests from PubsubIO.Write in Dataflow jobs.
gcp.dataflow.job.pubsub.write_latencies.avg
(count)
The average Pub/Sub Publish request latencies from PubsubIO.Write in Dataflow jobs.
Shown as millisecond
gcp.dataflow.job.pubsub.write_latencies.samplecount
(count)
The sample count for Pub/Sub Publish request latencies from PubsubIO.Write in Dataflow jobs.
Shown as millisecond
gcp.dataflow.job.pubsub.write_latencies.sumsqdev
(count)
The sum of squared deviation for Pub/Sub Publish request latencies from PubsubIO.Write in Dataflow jobs.
Shown as millisecond
gcp.dataflow.job.status
(gauge)
Current state of this pipeline (for example, RUNNING, DONE, CANCELLED, or FAILED). Not reported while the pipeline is not running.
gcp.dataflow.job.streaming_engine.key_processing_availability
(gauge)
Percentage of streaming processing keys that are assigned to workers and available to perform work. Work for unavailable keys will be deferred until keys are available.
gcp.dataflow.job.streaming_engine.persistent_state.read_bytes_count
(count)
Storage bytes read by a particular stage. Available for jobs running on Streaming Engine.
gcp.dataflow.job.streaming_engine.persistent_state.stored_bytes
(gauge)
Current bytes stored in persistent state for the job.
Shown as byte
gcp.dataflow.job.streaming_engine.persistent_state.write_bytes_count
(count)
Storage bytes written by a particular stage. Available for jobs running on Streaming Engine.
gcp.dataflow.job.streaming_engine.persistent_state.write_latencies.avg
(count)
The average storage write latencies from a particular stage. Available for jobs running on Streaming Engine.
Shown as millisecond
gcp.dataflow.job.streaming_engine.persistent_state.write_latencies.samplecount
(count)
The sample count for storage write latencies from a particular stage. Available for jobs running on Streaming Engine.
Shown as millisecond
gcp.dataflow.job.streaming_engine.persistent_state.write_latencies.sumsqdev
(count)
The sum of squared deviation for storage write latencies from a particular stage. Available for jobs running on Streaming Engine.
Shown as millisecond
gcp.dataflow.job.streaming_engine.stage_end_to_end_latencies.avg
(gauge)
The average distribution of time spent by streaming engine in each stage of the pipeline. This time includes shuffling messages, queueing them for processing, processing, queueing for persistent state write, and the write itself.
Shown as millisecond
gcp.dataflow.job.streaming_engine.stage_end_to_end_latencies.samplecount
(gauge)
The sample count for distribution of time spent by streaming engine in each stage of the pipeline. This time includes shuffling messages, queueing them for processing, processing, queueing for persistent state write, and the write itself.
Shown as millisecond
gcp.dataflow.job.streaming_engine.stage_end_to_end_latencies.sumsqdev
(gauge)
The sum of squared deviation for distribution of time spent by streaming engine in each stage of the pipeline. This time includes shuffling messages, queueing them for processing, processing, queueing for persistent state write, and the write itself.
Shown as millisecond
gcp.dataflow.job.system_lag
(gauge)
The current maximum duration that an item of data has been awaiting processing, in seconds.
Shown as second
gcp.dataflow.job.target_worker_instances
(gauge)
The desired number of worker instances.
gcp.dataflow.job.thread_time
(count)
Estimated time in milliseconds spent running in the function of the ptransform totaled across threads on all workers of the job.
Shown as millisecond
gcp.dataflow.job.timers_pending_count
(count)
The number of timers pending in a particular stage. Available for jobs running on Streaming Engine.
gcp.dataflow.job.timers_processed_count
(count)
The number of timers completed by a particular stage. Available for jobs running on Streaming Engine.
gcp.dataflow.job.total_dcu_usage
(count)
Cumulative amount of DCUs (Data Compute Unit) used by the Dataflow job since it was launched.
gcp.dataflow.job.total_memory_usage_time
(gauge)
The total GB seconds of memory allocated to this Dataflow job.
Shown as gibibyte
gcp.dataflow.job.total_pd_usage_time
(gauge)
The total GB seconds for all persistent disk used by all workers associated with this Dataflow job.
Shown as gibibyte
gcp.dataflow.job.total_secu_usage
(gauge)
The total amount of SECUs (Streaming Engine Compute Unit) used by the Dataflow job since it was launched.
gcp.dataflow.job.total_shuffle_data_processed
(gauge)
The total bytes of shuffle data processed by this Dataflow job.
Shown as byte
gcp.dataflow.job.total_streaming_data_processed
(gauge)
The total bytes of streaming data processed by this Dataflow job.
Shown as byte
gcp.dataflow.job.total_vcpu_time
(gauge)
The total vCPU seconds used by this Dataflow job.
gcp.dataflow.job.user_counter
(gauge)
A user-defined counter metric.
gcp.dataflow.job.worker_utilization_hint
(gauge)
Worker utilization hint for autoscaling. This hint value is configured by the customers and defines a target worker CPU utilization range thus influencing scaling aggressiveness.
Shown as percent
gcp.dataflow.job.worker_utilization_hint_is_actively_used
(gauge)
Reports whether or not the worker utilization hint is actively used by the horizontal autoscaling policy.
gcp.dataflow.quota.region_endpoint_shuffle_slot.exceeded
(count)
Number of attempts to exceed the limit on quota metric dataflow.googleapis.com/region_endpoint_shuffle_slot.
gcp.dataflow.quota.region_endpoint_shuffle_slot.limit
(gauge)
Current limit on quota metric dataflow.googleapis.com/region_endpoint_shuffle_slot.
gcp.dataflow.quota.region_endpoint_shuffle_slot.usage
(gauge)
Current usage on quota metric dataflow.googleapis.com/region_endpoint_shuffle_slot.
gcp.dataflow.worker.memory.bytes_used
(gauge)
The memory usage in bytes by a particular container instance on a Dataflow worker.
Shown as byte
gcp.dataflow.worker.memory.container_limit
(gauge)
Maximum RAM size in bytes available to a particular container instance on a Dataflow worker.
Shown as byte
gcp.dataflow.worker.memory.total_limit
(gauge)
RAM size in bytes available to a Dataflow worker.
Shown as byte
Google Cloud Dataflow를 사용하여 Apache Beam 파이프라인 메트릭을 모니터링하는 경우 게이지 정적 메서드에서 생성한 메트릭은 수집하지 않습니다. 해당 메트릭을 모니터링해야 하는 경우 마이크로미터(Micrometer)를 사용할 수 있습니다.

이벤트

Google Cloud Dataflow 통합은 이벤트를 포함하지 않습니다.

서비스 점검

Google Cloud Dataflow 통합은 서비스 점검을 포함하지 않습니다.

트러블슈팅

도움이 필요하신가요? Datadog 지원팀에 문의하세요.

참고 자료

추가 유용한 문서, 링크 및 기사: