Google Machine Learning

Overview

Google Cloud Machine Learning is a managed service that enables you to easily build machine learning models, that work on any type of data, of any size.

Get metrics from Google Machine Learning to:

  • Visualize the performance of your ML Services.
  • Correlate the performance of your ML Services with your applications.

Setup

Installation

If you haven’t already, set up the Google Cloud Platform integration first. There are no other installation steps that need to be performed.

Log collection

Google Cloud Machine Learning logs are collected with Google Cloud Logging and sent to a Dataflow job through a Cloud Pub/Sub topic. If you haven’t already, set up logging with the Datadog Dataflow template.

Once this is done, export your Google Cloud Machine Learning logs from Google Cloud Logging to the Pub/Sub topic:

  1. Go to the Google Cloud Logging page and filter the Google Cloud Machine Learning logs.
  2. Click Create Export and name the sink.
  3. Choose “Cloud Pub/Sub” as the destination and select the Pub/Sub topic that was created for that purpose. Note: The Pub/Sub topic can be located in a different project.
  4. Click Create and wait for the confirmation message to show up.

Data Collected

Metrics

gcp.ml.prediction.error_count
(count)
Cumulative count of prediction errors.
gcp.ml.prediction.latencies.avg
(count)
The average latency of a certain type.
Shown as microsecond
gcp.ml.prediction.latencies.samplecount
(count)
The sample count for latency of a certain type.
Shown as microsecond
gcp.ml.prediction.latencies.sumsqdev
(count)
The sum of squared deviation for latency of a certain type.
Shown as microsecond
gcp.ml.prediction.online.accelerator.duty_cycle
(gauge)
Average fraction of time over the past sample period during which the accelerator(s) were actively processing.
gcp.ml.prediction.online.accelerator.memory.bytes_used
(gauge)
Amount of accelerator memory allocated by the model replica.
Shown as byte
gcp.ml.prediction.online.cpu.utilization
(gauge)
Fraction of CPU allocated by the model replica and currently in use. May exceed 100% if the machine type has multiple CPUs.
gcp.ml.prediction.online.memory.bytes_used
(gauge)
Amount of memory allocated by the model replica and currently in use.
Shown as byte
gcp.ml.prediction.online.network.bytes_received
(count)
Number of bytes received over the network by the model replica.
Shown as byte
gcp.ml.prediction.online.network.bytes_sent
(count)
Number of bytes sent over the network by the model replica.
Shown as byte
gcp.ml.prediction.online.replicas
(gauge)
Number of active model replicas.
gcp.ml.prediction.online.target_replicas
(gauge)
Aspired number of active model replicas.
gcp.ml.prediction.prediction_count
(count)
Cumulative count of predictions.
gcp.ml.prediction.response_count
(count)
Cumulative count of different response codes.
gcp.ml.training.accelerator.memory.utilization
(gauge)
Fraction of allocated accelerator memory that is currently in use. Values are numbers between 0.0 and 1.0, charts display the values as a percentage between 0% and 100%.
gcp.ml.training.accelerator.utilization
(gauge)
Fraction of allocated accelerator that is currently in use. Values are numbers between 0.0 and 1.0, charts display the values as a percentage between 0% and 100%.
gcp.ml.training.cpu.utilization
(gauge)
Fraction of allocated CPU that is currently in use. Values are numbers between 0.0 and 1.0, charts display the values as a percentage between 0% and 100%.
gcp.ml.training.memory.utilization
(gauge)
Fraction of allocated memory that is currently in use. Values are numbers between 0.0 and 1.0, charts display the values as a percentage between 0% and 100%.
gcp.ml.training.network.received_bytes_count
(count)
Number of bytes received by the training job over the network.
Shown as byte
gcp.ml.training.network.sent_bytes_count
(count)
Number of bytes sent by the training job over the network.
Shown as byte

Events

The Google Cloud Machine Learning integration does not include any events.

Service Checks

The Google Cloud Machine Learning integration does not include any service checks.

Troubleshooting

Need help? Contact Datadog support.

Further reading

Additional helpful documentation, links, and articles: