Overview
Google Cloud TPU products make the benefits of Tensor Processing Units (TPUs) available through scalable and easy-to-use cloud computing resource for all ML researchers, ML engineers, developers, and data scientists running cutting-edge ML models.
Use the Datadog Google Cloud Platform integration to collect metrics from Google Cloud TPU.
Setup
Installation
If you haven’t already, set up the Google Cloud Platform integration first. There are no other installation steps.
Log collection
Google Cloud TPU logs are collected with Google Cloud Logging and sent to a Cloud pub/sub with an HTTP push forwarder. If you haven’t already, set up a Cloud pub/sub with an HTTP push forwarder.
Once this is done, export your Google Cloud TPU logs from Google Cloud Logging to the pub/sub:
- Go to the Google Cloud Logging page and filter the Google Cloud TPU logs.
- Click Create Export and name the sink.
- Choose “Cloud Pub/Sub” as the destination and select the pub/sub that was created for that purpose. Note: The pub/sub can be located in a different project.
- Click Create and wait for the confirmation message to show up.
Data Collected
Metrics
gcp.tpu.cpu.utilization (gauge) | Utilization of CPUs on the TPU Worker as a percent. Shown as percent |
gcp.tpu.memory.usage (gauge) | Memory usage in bytes. Shown as byte |
gcp.tpu.network.received_bytes_count (count) | Cumulative bytes of data this server has received over the network. Shown as byte |
gcp.tpu.network.sent_bytes_count (count) | Cumulative bytes of data this server has sent over the network. Shown as byte |
Events
The Google Cloud TPU integration does not include any events.
Service Checks
The Google Cloud TPU integration does not include any service checks.
Troubleshooting
Need help? Contact Datadog support.