概要

Data Jobs Monitoring は、Dataproc クラスター上の Spark ジョブの観測、トラブルシューティング、コスト最適化を支援します。

Google Cloud Dataproc は、Apache Spark と Apache Hadoop のクラスターを簡単かつコスト効率よく実行するための高速で使いやすいフルマネージド型のクラウドサービスです。

Datadog Google Cloud Platform インテグレーションを使用して、Google Cloud Dataproc からメトリクスを収集できます。

セットアップ

インストール

Google Cloud Platform インテグレーションをまだセットアップしていない場合は、最初にセットアップします。それ以上のインストール手順はありません。

ログ収集

Google Cloud Dataproc のログは Google Cloud Logging で収集され、Cloud Pub/Sub トピックを通じて Dataflow ジョブに送信されます。まだの場合は、Datadog Dataflow テンプレートでロギングをセットアップしてください

これが完了したら、Google Cloud Dataproc のログを Google Cloud Logging から Pub/Sub トピックへエクスポートします。

  1. Google Cloud Logging のページに移動し、Google Cloud Dataproc のログを絞り込みます。
  2. Create Export をクリックし、シンクに名前を付けます。
  3. 宛先として “Cloud Pub/Sub” を選択し、その目的で作成された Pub/Sub トピックを選択します。: Pub/Sub トピックは別のプロジェクトに配置できます。
  4. 作成をクリックし、確認メッセージが表示されるまで待ちます。

収集データ

メトリクス

gcp.dataproc.batch.spark.executors
(gauge)
Indicates the number of Batch Spark executors.
Shown as worker
gcp.dataproc.cluster.hdfs.datanodes
(gauge)
Indicates the number of HDFS DataNodes that are running inside a cluster.
Shown as node
gcp.dataproc.cluster.hdfs.storage_capacity
(gauge)
Indicates capacity of HDFS system running on a cluster in GB.
Shown as gibibyte
gcp.dataproc.cluster.hdfs.storage_utilization
(gauge)
The percentage of HDFS storage currently used.
Shown as percent
gcp.dataproc.cluster.hdfs.unhealthy_blocks
(gauge)
Indicates the number of unhealthy blocks inside the cluster.
Shown as block
gcp.dataproc.cluster.job.completion_time.avg
(gauge)
The time jobs took to complete from the time the user submits a job to the time Dataproc reports it is completed.
Shown as millisecond
gcp.dataproc.cluster.job.completion_time.samplecount
(count)
Sample count for cluster job completion time.
Shown as millisecond
gcp.dataproc.cluster.job.completion_time.sumsqdev
(gauge)
Sum of squared deviation for cluster job completion time.
Shown as second
gcp.dataproc.cluster.job.duration.avg
(gauge)
The time jobs have spent in a given state.
Shown as millisecond
gcp.dataproc.cluster.job.duration.samplecount
(count)
Sample count for cluster job duration.
Shown as millisecond
gcp.dataproc.cluster.job.duration.sumsqdev
(gauge)
Sum of squared deviation for cluster job duration.
Shown as second
gcp.dataproc.cluster.job.failed_count
(count)
Indicates the number of jobs that have failed on a cluster.
Shown as job
gcp.dataproc.cluster.job.running_count
(gauge)
Indicates the number of jobs that are running on a cluster.
Shown as job
gcp.dataproc.cluster.job.submitted_count
(count)
Indicates the number of jobs that have been submitted to a cluster.
Shown as job
gcp.dataproc.cluster.nodes.expected
(gauge)
Indicates the number of nodes that are expected in a cluster.
Shown as node
gcp.dataproc.cluster.nodes.failed_count
(count)
Indicates the number of nodes that have failed in a cluster.
Shown as node
gcp.dataproc.cluster.nodes.recovered_count
(count)
Indicates the number of nodes that are detected as failed and have been successfully removed from cluster.
Shown as node
gcp.dataproc.cluster.nodes.running
(gauge)
Indicates the number of nodes in running state.
Shown as node
gcp.dataproc.cluster.operation.completion_time.avg
(gauge)
The time operations took to complete from the time the user submits a operation to the time Dataproc reports it is completed.
Shown as millisecond
gcp.dataproc.cluster.operation.completion_time.samplecount
(count)
Sample count for cluster operation completion time.
Shown as millisecond
gcp.dataproc.cluster.operation.completion_time.sumsqdev
(gauge)
Sum of squared deviation for cluster operation completion time.
Shown as second
gcp.dataproc.cluster.operation.duration.avg
(gauge)
The time operations have spent in a given state.
Shown as millisecond
gcp.dataproc.cluster.operation.duration.samplecount
(count)
Sample count for cluster operation duration.
Shown as millisecond
gcp.dataproc.cluster.operation.duration.sumsqdev
(gauge)
Sum of squared deviation for cluster operation duration.
Shown as second
gcp.dataproc.cluster.operation.failed_count
(count)
Indicates the number of operations that have failed on a cluster.
Shown as operation
gcp.dataproc.cluster.operation.running_count
(gauge)
Indicates the number of operations that are running on a cluster.
Shown as operation
gcp.dataproc.cluster.operation.submitted_count
(count)
Indicates the number of operations that have been submitted to a cluster.
Shown as operation
gcp.dataproc.cluster.yarn.allocated_memory_percentage
(gauge)
The percentage of YARN memory is allocated.
Shown as percent
gcp.dataproc.cluster.yarn.apps
(gauge)
Indicates the number of active YARN applications.
gcp.dataproc.cluster.yarn.containers
(gauge)
Indicates the number of YARN containers.
Shown as container
gcp.dataproc.cluster.yarn.memory_size
(gauge)
Indicates the YARN memory size in GB.
Shown as gibibyte
gcp.dataproc.cluster.yarn.nodemanagers
(gauge)
Indicates the number of YARN NodeManagers running inside cluster.
gcp.dataproc.cluster.yarn.pending_memory_size
(gauge)
The current memory request, in GB, that is pending to be fulfilled by the scheduler.
Shown as gibibyte
gcp.dataproc.cluster.yarn.virtual_cores
(gauge)
Indicates the number of virtual cores in YARN.
Shown as core
gcp.dataproc.job.state
(gauge)
Indicates whether job is currently in a particular state or not.
gcp.dataproc.session.spark.executors
(gauge)
Indicates the number of Session Spark executors.
Shown as worker

イベント

Google Cloud Dataproc インテグレーションには、イベントは含まれません。

サービスチェック

Google Cloud Dataproc インテグレーションには、サービスのチェック機能は含まれません。

トラブルシューティング

ご不明な点は、Datadog のサポートチームまでお問い合わせください。