Amazon SageMaker

Amazon SageMaker

Crawler Crawler

Overview

Amazon SageMaker is a fully managed machine learning service. With Amazon SageMaker, data scientists and developers can build and train machine learning models, and then directly deploy them into a production-ready hosted environment.

Enable this integration to see all your SageMaker metrics in Datadog.

Setup

Installation

If you haven’t already, set up the Amazon Web Services integration first.

Metric collection

  1. In the AWS integration tile, ensure that SageMaker is checked under metric collection.
  2. Install the Datadog - Amazon SageMaker integration.

Log collection

Enable logging

Configure Amazon SageMaker to send logs either to a S3 bucket or to CloudWatch.

Note: If you log to a S3 bucket, make sure that amazon_sagemaker is set as Target prefix.

Send logs to Datadog

  1. If you haven’t already, set up the Datadog log collection AWS Lambda function.

  2. Once the lambda function is installed, manually add a trigger on the S3 bucket or CloudWatch log group that contains your Amazon SageMaker logs in the AWS console:

Data Collected

Metrics

aws.sagemaker.cpu_utilization
(count)
The percentage of CPU units that are used by the containers on an instance.
Shown as percent
aws.sagemaker.dataset_objects_auto_annotated
(count)
The number of dataset objects auto-annotated in a labeling job.
Shown as object
aws.sagemaker.dataset_objects_human_annotated
(count)
The number of dataset objects annotated by a human in a labeling job.
Shown as object
aws.sagemaker.dataset_objects_labeling_failed
(count)
The number of dataset objects that failed labeling in a labeling job.
Shown as object
aws.sagemaker.disk_utilization
(count)
The percentage of disk space used by the containers on an instance uses.
Shown as percent
aws.sagemaker.gpu_memory_utilization
(count)
The percentage of GPU memory used by the containers on an instance.
Shown as percent
aws.sagemaker.gpu_utilization
(count)
The percentage of GPU units that are used by the containers on an instance.
Shown as percent
aws.sagemaker.invocation_4xx_errors
(count)
The average number of InvokeEndpoint requests where the model returned a 4xx HTTP response code.
Shown as request
aws.sagemaker.invocation_4xx_errors.sum
(count)
The sum of the number of InvokeEndpoint requests where the model returned a 4xx HTTP response code.
Shown as request
aws.sagemaker.invocation_5xx_errors
(count)
The average number of InvokeEndpoint requests where the model returned a 5xx HTTP response code.
Shown as request
aws.sagemaker.invocation_5xx_errors.sum
(count)
The sum of the number of InvokeEndpoint requests where the model returned a 5xx HTTP response code.
Shown as request
aws.sagemaker.invocations
(count)
The sum of the number of InvokeEndpoint requests sent to a model endpoint.
Shown as request
aws.sagemaker.invocations_per_instance
(count)
The number of invocations sent to a model normalized by InstanceCount in each ProductionVariant.
aws.sagemaker.invocations.sample_count
(count)
The sample count of the number of InvokeEndpoint requests sent to a model endpoint.
Shown as request
aws.sagemaker.jobs_failed
(count)
The sum of the number of labeling jobs that failed.
Shown as job
aws.sagemaker.jobs_failed.sample_count
(count)
The sample count of the number of labeling jobs that failed.
Shown as job
aws.sagemaker.jobs_stopped
(count)
The sum of the number of labeling jobs that were stopped.
Shown as job
aws.sagemaker.jobs_stopped.sample_count
(count)
The sample count of the number of labeling jobs that were stopped.
Shown as job
aws.sagemaker.jobs_succeeded
(count)
The sum of the number of labeling jobs that succeeded.
Shown as job
aws.sagemaker.jobs_succeeded.sample_count
(count)
The sample count number of labeling jobs that succeeded.
Shown as job
aws.sagemaker.memory_utilization
(count)
The percentage of memory that is used by the containers on an instance.
Shown as percent
aws.sagemaker.model_cache_hit
(count)
The number of InvokeEndpoint requests sent to the multi-model endpoint for which the model was already loaded.
Shown as request
aws.sagemaker.model_downloading_time
(count)
The interval of time that it takes to download the model from Amazon Simple Storage Service (Amazon S3).
Shown as microsecond
aws.sagemaker.model_latency
(count)
The average interval of time taken by a model to respond as viewed from Amazon SageMaker.
Shown as microsecond
aws.sagemaker.model_latency.maximum
(count)
The maximum interval of time taken by a model to respond as viewed from Amazon SageMaker.
Shown as microsecond
aws.sagemaker.model_latency.mininmum
(count)
The minimum interval of time taken by a model to respond as viewed from Amazon SageMaker.
Shown as microsecond
aws.sagemaker.model_latency.sample_count
(count)
The sample count interval of time taken by a model to respond as viewed from Amazon SageMaker.
Shown as microsecond
aws.sagemaker.model_latency.sum
(count)
The sum of the interval of time taken by a model to respond as viewed from Amazon SageMaker.
Shown as microsecond
aws.sagemaker.model_loading_time
(count)
The interval of time that it takes to load the model through the container's LoadModel API call.
Shown as microsecond
aws.sagemaker.model_loading_wait_time
(count)
The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference.
Shown as microsecond
aws.sagemaker.model_unloading_time
(count)
The interval of time that it takes to unload the model through the container's UnloadModel API call.
Shown as microsecond
aws.sagemaker.overhead_latency
(count)
The average interval of time added to the time taken to respond to a client request by Amazon SageMaker overheads.
Shown as microsecond
aws.sagemaker.overhead_latency.maximum
(count)
The maximum interval of time added to the time taken to respond to a client request by Amazon SageMaker overheads.
Shown as microsecond
aws.sagemaker.overhead_latency.minimum
(count)
The minimum interval of time added to the time taken to respond to a client request by Amazon SageMaker overheads.
Shown as microsecond
aws.sagemaker.overhead_latency.sample_count
(count)
The sample count of the interval of time added to the time taken to respond to a client request by Amazon SageMaker overheads.
Shown as microsecond
aws.sagemaker.overhead_latency.sum
(count)
The sum of the interval of time added to the time taken to respond to a client request by Amazon SageMaker overheads.
Shown as microsecond
aws.sagemaker.total_dataset_objects_labeled
(count)
The maximum number of dataset objects labeled successfully in a labeling job.
Shown as object

Events

The Amazon SageMaker integration does not include any events.

Service Checks

The Amazon SageMaker integration does not include any service checks.

Troubleshooting

Need help? Contact Datadog support.