Microsoft Azure Machine Learning

Overview

The Azure Machine Learning service empowers developers and data scientists with a wide range of productive experiences for building, training, and deploying machine learning models faster. Use Datadog to monitor your Azure Machine Learning performance and utilization in context with the rest of your applications and infrastructure.

Get metrics from Azure Machine Learning to:

  • Track the number and status of runs and model deployments.
  • Monitor the utilization of your machine learning nodes.
  • Optimize performance vs. cost.

Setup

Installation

If you haven’t already, set up the Microsoft Azure integration first. There are no other installation steps.

Data Collected

Metrics

azure.machinelearningservices_workspaces.completed_runs
(gauge)
The number of runs completed successfully for this workspace.
Shown as operation
azure.machinelearningservices_workspaces.started_runs
(gauge)
The number of runs started for this workspace.
Shown as operation
azure.machinelearningservices_workspaces.failed_runs
(gauge)
The number of runs failed for this workspace.
Shown as operation
azure.machinelearningservices_workspaces.model_register_succeeded
(gauge)
The number of model registrations that succeeded in this workspace.
azure.machinelearningservices_workspaces.model_register_failed
(gauge)
The number of model registrations that failed in this workspace.
azure.machinelearningservices_workspaces.model_deploy_started
(gauge)
The number of model deployments started in this workspace.
azure.machinelearningservices_workspaces.model_deploy_succeeded
(gauge)
The number of model deployments that succeeded in this workspace.
azure.machinelearningservices_workspaces.moddel_deploy_failed
(gauge)
The number of model deployments that failed in this workspace.
azure.machinelearningservices_workspaces.total_nodes
(gauge)
The number of total nodes. This total includes some of Active Nodes, Idle Nodes, Unusable Nodes, Premepted Nodes, Leaving Nodes.
Shown as node
azure.machinelearningservices_workspaces.active_nodes
(gauge)
The number of Acitve nodes. These are the nodes which are actively running a job.
Shown as node
azure.machinelearningservices_workspaces.idle_nodes
(gauge)
The number of idle nodes. Idle nodes are the nodes which are not running any jobs but can accept new job if available.
Shown as node
azure.machinelearningservices_workspaces.unusable_nodes
(gauge)
The number of unusable nodes. Unusable nodes are not functional due to some unresolvable issue. Azure will recycle these nodes.
Shown as node
azure.machinelearningservices_workspaces.preempted_nodes
(gauge)
The number of preempted nodes. These nodes are the low priority nodes which are taken away from the available node pool.
Shown as node
azure.machinelearningservices_workspaces.leaving_nodes
(gauge)
The number of leaving nodes. Leaving nodes are the nodes which just finished processing a job and will go to Idle state.
Shown as node
azure.machinelearningservices_workspaces.total_cores
(gauge)
The number of total cores.
Shown as core
azure.machinelearningservices_workspaces.active_cores
(gauge)
The number of active cores.
Shown as core
azure.machinelearningservices_workspaces.idle_cores
(gauge)
The number of idle cores.
Shown as core
azure.machinelearningservices_workspaces.unusable_cores
(gauge)
The number of unusable cores.
Shown as core
azure.machinelearningservices_workspaces.preempted_cores
(gauge)
The number of preempted cores.
Shown as core
azure.machinelearningservices_workspaces.leaving_cores
(gauge)
The number of leaving cores.
Shown as core
azure.machinelearningservices_workspaces.quota_utilization_percentage
(gauge)
The percent of quota utilized.
Shown as percent
azure.machinelearningservices_workspaces.cpuutilization
(gauge)
CPU utilization
Shown as percent
azure.machinelearningservices_workspaces.gpuutilization
(gauge)
GPU utilization
Shown as percent

Events

The Azure Machine Learning integration does not include any events.

Service Checks

The Azure Machine Learning integration does not include any service checks.

Troubleshooting

Need help? Contact Datadog support.

Further reading

Additional helpful documentation, links, and articles: