AWS Step Functions

Overview

AWS Step Functions enables you to coordinate the components of distributed applications and microservices using visual workflows.

Enable this integration to see all your Step Functions metrics in Datadog.

For native AWS Step Functions monitoring, see Datadog Serverless Monitoring for AWS Step Functions.

Setup

Installation

If you haven’t already, set up the Amazon Web Services integration first. Then, add the following permissions to the policy document for your AWS/Datadog Role:

states:ListStateMachines,
states:DescribeStateMachine

Metric collection

  1. In the AWS integration page, ensure that States is enabled under the Metric Collection tab. If your state machines use AWS Lambda, also ensure that Lambda is checked.
  2. Install the Datadog - AWS Step Functions integration.

Augmenting AWS Lambda metrics

If your Step Functions states are Lambda functions, installing this integration adds additional tags statemachinename, statemachinearn, and stepname to your Lambda metrics. This lets you see which state machines your Lambda functions belong to, and you can visualize this on the Serverless page.

Enhanced metric collection

Datadog can also generate metrics for your Step Functions to help you track the average or p99 of individual step durations. To collect enhanced metrics for AWS Step Functions, you must use Datadog APM.

Log collection

  1. Configure AWS Step Functions to send logs to CloudWatch. Note: Use the default CloudWatch log group prefix /aws/vendedlogs/states for Datadog to identify the source of the logs and parse them automatically.
  2. Send the logs to Datadog.

Trace collection

You can enable trace collection in two ways: through Datadog APM for Step Functions, or through AWS X-Ray.

Enable tracing through Datadog APM for AWS Step Functions

To enable distributed tracing for your AWS Step Functions, see the installation instructions in the Serverless documentation.

Enable tracing through AWS X-Ray

This option does not collect enhanced metrics for AWS Step Functions. For these metrics, you must enable tracing through Datadog APM for AWS Step Functions.

To collect traces from your AWS Step Functions through AWS X-Ray:

  1. Enable the Datadog AWS X-Ray integration.
  2. Log in to the AWS Console.
  3. Browse to Step Functions.
  4. Select one of your Step Functions and click Edit.
  5. Scroll to the Tracing section at the bottom of the page and check the box to Enable X-Ray tracing.
  6. Recommended: Install the AWS X-Ray tracing library in your functions for more detailed traces.

Data Collected

Metrics

aws.states.activities_failed
(count)
The number of activities that failed.
aws.states.activities_heartbeat_timed_out
(count)
The number of activities that were timed out due to a heartbeat timeout.
aws.states.activities_scheduled
(count)
The number of activities that were scheduled.
aws.states.activities_started
(count)
The number of activities that were started.
aws.states.activities_succeeded
(count)
The number of activities that completed successfully.
aws.states.activities_timed_out
(count)
The number of activities that were timed out on close.
aws.states.activity_run_time
(gauge)
The average time interval, in milliseconds, between the time the activity was started and when it was closed.
Shown as millisecond
aws.states.activity_run_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the activity was started and when it was closed.
Shown as millisecond
aws.states.activity_run_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the activity was started and when it was closed.
Shown as millisecond
aws.states.activity_run_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the activity was started and when it was closed.
Shown as millisecond
aws.states.activity_run_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the activity was started and when it was closed.
Shown as millisecond
aws.states.activity_schedule_time
(gauge)
The avg time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.activity_schedule_time.maximum
(gauge)
The maximum time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.activity_schedule_time.minimum
(gauge)
The minimum time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.activity_schedule_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.activity_schedule_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.activity_time
(gauge)
The average time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
Shown as millisecond
aws.states.activity_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
Shown as millisecond
aws.states.activity_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
Shown as millisecond
aws.states.activity_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
Shown as millisecond
aws.states.activity_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
Shown as millisecond
aws.states.enhanced.execution.execution_time
(gauge)
The average execution time of the state machine.
Shown as nanosecond
aws.states.enhanced.execution.execution_time.maximum
(gauge)
The maximum execution time of the state machine.
Shown as nanosecond
aws.states.enhanced.execution.execution_time.minimum
(gauge)
The minimum execution time of the state machine.
Shown as nanosecond
aws.states.enhanced.execution.execution_time.p95
(gauge)
The 95th percentile of the execution time of the state machine.
Shown as nanosecond
aws.states.enhanced.execution.execution_time.p99
(gauge)
The 99th percentile of the execution time of the state machine.
Shown as nanosecond
aws.states.enhanced.execution.failed
(count)
The number of state machine executions that failed.
aws.states.enhanced.execution.started
(count)
The number of state machine executions started.
aws.states.enhanced.execution.succeeded
(count)
The number of state machine executions that succeeded.
aws.states.enhanced.task.execution.task_duration
(gauge)
The average duration of one task in the state machine.
Shown as nanosecond
aws.states.enhanced.task.execution.task_duration.maximum
(gauge)
The maximum duration of one task in the state machine.
Shown as nanosecond
aws.states.enhanced.task.execution.task_duration.minimum
(gauge)
The minimum duration of one task in the state machine.
Shown as nanosecond
aws.states.enhanced.task.execution.task_duration.p95
(gauge)
The 95th percentile of the duration of one task in the state machine.
Shown as nanosecond
aws.states.enhanced.task.execution.task_duration.p99
(gauge)
The 99th percentile of the duration of one task in the state machine.
Shown as nanosecond
aws.states.enhanced.task.execution.task_failed
(count)
The number of state machine task executions that failed.
aws.states.enhanced.task.execution.task_started
(count)
The number of state machine task executions started.
aws.states.enhanced.task.execution.task_succeeded
(count)
The number of state machine task executions that succeeded.
aws.states.execution_throttled
(count)
The number of StateEntered events in addition to retries
aws.states.execution_time
(gauge)
The average time interval, in milliseconds, between the time the execution started and the time it closed.
Shown as millisecond
aws.states.execution_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the execution started and the time it closed.
Shown as millisecond
aws.states.execution_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the execution started and the time it closed.
Shown as millisecond
aws.states.execution_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the execution started and the time it closed.
Shown as millisecond
aws.states.execution_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the execution started and the time it closed.il
Shown as millisecond
aws.states.executions_aborted
(count)
The number of executions that were aborted/terminated.
aws.states.executions_failed
(count)
The number of executions that failed.
aws.states.executions_started
(count)
The number of executions started.
aws.states.executions_succeeded
(count)
The number of executions that completed successfully.
aws.states.executions_timed_out
(count)
The number of executions that timed out for any reason.
aws.states.lambda_function_run_time
(gauge)
The average time interval, in milliseconds, between the time the lambda function was started and when it was closed.
Shown as millisecond
aws.states.lambda_function_run_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the lambda function was started and when it was closed.
Shown as millisecond
aws.states.lambda_function_run_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the lambda function was started and when it was closed.
Shown as millisecond
aws.states.lambda_function_run_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed.
Shown as millisecond
aws.states.lambda_function_run_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed.
Shown as millisecond
aws.states.lambda_function_schedule_time
(gauge)
The avg time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.lambda_function_schedule_time.maximum
(gauge)
The maximum time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.lambda_function_schedule_time.minimum
(gauge)
The minimum time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.lambda_function_schedule_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.lambda_function_schedule_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.lambda_function_time
(gauge)
The average time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
Shown as millisecond
aws.states.lambda_function_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
Shown as millisecond
aws.states.lambda_function_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
Shown as millisecond
aws.states.lambda_function_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
Shown as millisecond
aws.states.lambda_function_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
Shown as millisecond
aws.states.lambda_functions_failed
(count)
The number of lambda functions that failed.
aws.states.lambda_functions_heartbeat_timed_out
(count)
The number of lambda functions that were timed out due to a heartbeat timeout.
aws.states.lambda_functions_scheduled
(count)
The number of lambda functions that were scheduled.
aws.states.lambda_functions_started
(count)
The number of lambda functions that were started.
aws.states.lambda_functions_succeeded
(count)
The number of lambda functions that completed successfully.
aws.states.lambda_functions_timed_out
(count)
The number of lambda functions that were timed out on close.

Events

The AWS Step Functions integration does not include any events.

Service Checks

The AWS Step Functions integration does not include any service checks.

Troubleshooting

Need help? Contact Datadog support.