---
title: Google Cloud Dataproc
description: >-
  A managed cloud service for cost-effective operation of Apache Spark and
  Hadoop clusters.
breadcrumbs: Docs > Integrations > Google Cloud Dataproc
---

# Google Cloud Dataproc

## Overview{% #overview %}

{% alert level="info" %}
[Data Observability: Jobs Monitoring](https://docs.datadoghq.com/data_jobs/) helps you observe, troubleshoot, and cost-optimize your Spark jobs on your Dataproc clusters.
{% /alert %}

Google Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running Apache Spark and Apache Hadoop clusters in a simpler, more cost-efficient way.

Use the Datadog Google Cloud Platform integration to collect metrics from Google Cloud Dataproc.

## Setup{% #setup %}

### Installation{% #installation %}

If you haven't already, set up the [Google Cloud Platform integration](https://docs.datadoghq.com/integrations/google-cloud-platform/) first. There are no other installation steps.

### Log collection{% #log-collection %}

Google Cloud Dataproc logs are collected with Google Cloud Logging and sent to a Dataflow job through a Cloud Pub/Sub topic. If you haven't already, [set up logging with the Datadog Dataflow template](https://docs.datadoghq.com/integrations/google-cloud-platform/#log-collection).

Once this is done, export your Google Cloud Dataproc logs from Google Cloud Logging to the Pub/Sub topic:

1. Go to the [Google Cloud Logging page](https://console.cloud.google.com/logs/viewer) and filter the Google Cloud Dataproc logs.
1. Click **Create Export** and name the sink.
1. Choose "Cloud Pub/Sub" as the destination and select the Pub/Sub topic that was created for that purpose. **Note**: The Pub/Sub topic can be located in a different project.
1. Click **Create** and wait for the confirmation message to show up.

## Data Collected{% #data-collected %}

### Metrics{% #metrics %}

|  |
|  |
| **gcp.dataproc.batch.spark.executors**(gauge)                          | Indicates the number of Batch Spark executors.*Shown as worker*                                                                                      |
| **gcp.dataproc.cluster.capacity\_deviation**(gauge)                    | Difference between the expected node count in the cluster and the actual active YARN node managers.                                                  |
| **gcp.dataproc.cluster.hdfs.datanodes**(gauge)                         | Indicates the number of HDFS DataNodes that are running inside a cluster.*Shown as node*                                                             |
| **gcp.dataproc.cluster.hdfs.storage\_capacity**(gauge)                 | Indicates capacity of HDFS system running on a cluster in GB.*Shown as gibibyte*                                                                     |
| **gcp.dataproc.cluster.hdfs.storage\_utilization**(gauge)              | The percentage of HDFS storage currently used.*Shown as percent*                                                                                     |
| **gcp.dataproc.cluster.hdfs.unhealthy\_blocks**(gauge)                 | Indicates the number of unhealthy blocks inside the cluster.*Shown as block*                                                                         |
| **gcp.dataproc.cluster.job.completion\_time.avg**(gauge)               | The time jobs took to complete from the time the user submits a job to the time Dataproc reports it is completed.*Shown as millisecond*              |
| **gcp.dataproc.cluster.job.completion\_time.samplecount**(count)       | Sample count for cluster job completion time.*Shown as millisecond*                                                                                  |
| **gcp.dataproc.cluster.job.completion\_time.sumsqdev**(gauge)          | Sum of squared deviation for cluster job completion time.*Shown as second*                                                                           |
| **gcp.dataproc.cluster.job.duration.avg**(gauge)                       | The time jobs have spent in a given state.*Shown as millisecond*                                                                                     |
| **gcp.dataproc.cluster.job.duration.samplecount**(count)               | Sample count for cluster job duration.*Shown as millisecond*                                                                                         |
| **gcp.dataproc.cluster.job.duration.sumsqdev**(gauge)                  | Sum of squared deviation for cluster job duration.*Shown as second*                                                                                  |
| **gcp.dataproc.cluster.job.failed\_count**(count)                      | Indicates the number of jobs that have failed on a cluster.*Shown as job*                                                                            |
| **gcp.dataproc.cluster.job.running\_count**(gauge)                     | Indicates the number of jobs that are running on a cluster.*Shown as job*                                                                            |
| **gcp.dataproc.cluster.job.submitted\_count**(count)                   | Indicates the number of jobs that have been submitted to a cluster.*Shown as job*                                                                    |
| **gcp.dataproc.cluster.mig\_instances.failed\_count**(count)           | Indicates the number of instance failures for a managed instance group.                                                                              |
| **gcp.dataproc.cluster.nodes.expected**(gauge)                         | Indicates the number of nodes that are expected in a cluster.*Shown as node*                                                                         |
| **gcp.dataproc.cluster.nodes.failed\_count**(count)                    | Indicates the number of nodes that have failed in a cluster.*Shown as node*                                                                          |
| **gcp.dataproc.cluster.nodes.recovered\_count**(count)                 | Indicates the number of nodes that are detected as failed and have been successfully removed from cluster.*Shown as node*                            |
| **gcp.dataproc.cluster.nodes.running**(gauge)                          | Indicates the number of nodes in running state.*Shown as node*                                                                                       |
| **gcp.dataproc.cluster.operation.completion\_time.avg**(gauge)         | The time operations took to complete from the time the user submits an operation to the time Dataproc reports it is completed.*Shown as millisecond* |
| **gcp.dataproc.cluster.operation.completion\_time.samplecount**(count) | Sample count for cluster operation completion time.*Shown as millisecond*                                                                            |
| **gcp.dataproc.cluster.operation.completion\_time.sumsqdev**(gauge)    | Sum of squared deviation for cluster operation completion time.*Shown as second*                                                                     |
| **gcp.dataproc.cluster.operation.duration.avg**(gauge)                 | The time operations have spent in a given state.*Shown as millisecond*                                                                               |
| **gcp.dataproc.cluster.operation.duration.samplecount**(count)         | Sample count for cluster operation duration.*Shown as millisecond*                                                                                   |
| **gcp.dataproc.cluster.operation.duration.sumsqdev**(gauge)            | Sum of squared deviation for cluster operation duration.*Shown as second*                                                                            |
| **gcp.dataproc.cluster.operation.failed\_count**(count)                | Indicates the number of operations that have failed on a cluster.*Shown as operation*                                                                |
| **gcp.dataproc.cluster.operation.running\_count**(gauge)               | Indicates the number of operations that are running on a cluster.*Shown as operation*                                                                |
| **gcp.dataproc.cluster.operation.submitted\_count**(count)             | Indicates the number of operations that have been submitted to a cluster.*Shown as operation*                                                        |
| **gcp.dataproc.cluster.yarn.allocated\_memory\_percentage**(gauge)     | The percentage of YARN memory is allocated.*Shown as percent*                                                                                        |
| **gcp.dataproc.cluster.yarn.apps**(gauge)                              | Indicates the number of active YARN applications.                                                                                                    |
| **gcp.dataproc.cluster.yarn.containers**(gauge)                        | Indicates the number of YARN containers.*Shown as container*                                                                                         |
| **gcp.dataproc.cluster.yarn.memory\_size**(gauge)                      | Indicates the YARN memory size in GB.*Shown as gibibyte*                                                                                             |
| **gcp.dataproc.cluster.yarn.nodemanagers**(gauge)                      | Indicates the number of YARN NodeManagers running inside cluster.                                                                                    |
| **gcp.dataproc.cluster.yarn.pending\_memory\_size**(gauge)             | The current memory request, in GB, that is pending to be fulfilled by the scheduler.*Shown as gibibyte*                                              |
| **gcp.dataproc.cluster.yarn.virtual\_cores**(gauge)                    | Indicates the number of virtual cores in YARN.*Shown as core*                                                                                        |
| **gcp.dataproc.job.state**(gauge)                                      | Indicates whether job is currently in a particular state or not.                                                                                     |
| **gcp.dataproc.job.yarn.memory\_seconds**(gauge)                       | Indicates the Memory Seconds consumed by the `job_id` job per yarn `application_id`.                                                                 |
| **gcp.dataproc.job.yarn.vcore\_seconds**(gauge)                        | Indicates the VCore Seconds consumed by the `job_id` job per yarn `application_id`.                                                                  |
| **gcp.dataproc.node.problem\_count**(count)                            | Total number of times a specific type of problem has occurred.                                                                                       |
| **gcp.dataproc.node.yarn.nodemanager.health**(gauge)                   | YARN NodeManager health state.                                                                                                                       |
| **gcp.dataproc.session.spark.executors**(gauge)                        | Indicates the number of Session Spark executors.*Shown as worker*                                                                                    |

### Events{% #events %}

The Google Cloud Dataproc integration does not include any events.

### Service Checks{% #service-checks %}

The Google Cloud Dataproc integration does not include any service checks.

## Troubleshooting{% #troubleshooting %}

Need help? Contact [Datadog support](https://docs.datadoghq.com/help/).