AWS Step Functions

개요

AWS Step Functions로 시각적 워크플로우를 사용하여 분산 애플리케이션 및 마이크로서비스의 구성 요소를 조정할 수 있습니다.

이 통합을 활성화하면 Datadog에서 모든 Step Functions 메트릭을 볼 수 있습니다.

Datadog의 기본 AWS Step Function 모니터링 기능은 공개 베타 버전으로 제공됩니다. Step Functions를 향상된 메트릭 및 트레이스로 계측하려면 서버리스 문서를 참조하세요.

설정

설치

아직 설정하지 않았다면 먼저 Amazon Web Services 통합을 설정합니다. 그 다음 아래의 권한을 AWS/Datadog 역할에 맞는 정책 문서에 추가합니다.

states:ListStateMachines,
states:DescribeStateMachine

메트릭 수집

  1. AWS 통합 페이지에서 Metric Collection 탭 하단에 States이 활성화되어 있는지 확인합니다. 상태 시스템이 AWS Lambda를 사용하는 경우 Lambda도 체크되어 있는지 확인합니다.
  2. Datadog - AWS Step Functions 통합을 설치합니다.

AWS Lambda 메트릭 보강

스텝 함수 상태가 Lambda 함수라면 본 통합 설치 시 Lambda 메트릭에 추가 statemachinename, statemachinearn, stepname 태그가 추가됩니다. Lambda 함수가 어떤 상태 시스템에 속하는지 확인할 수 있으며 서버리스 페이지에서 시각화할 수도 있습니다.

향상된 메트릭 수집

Datadog은 또한 개별 단계 지속 시간의 평균 또는 p99를 추적하는 데 도움이 되는 Step Functions에 대한 메트릭을 생성할 수 있습니다. AWS Step Functions에 대한 향상된 메트릭을 수집하려면 Datadog APM을 사용해야 합니다.

로그 수집

  1. CloudWatch에 로그를 전송하도록 AWS Step Functions를 구성합니다. 참고: Datadog의 기본 CloudWatch 로그 그룹 접두사 /aws/vendedlogs/states를 사용하여 로그 소스를 식별하고 자동으로 파싱합니다.
  2. Datadog에 로그를 전송합니다.

트레이스 수집

Step Functions용 ​​Datadog APM 또는 AWS X-Ray를 통해 트레이스 수집을 활성화할 수 있습니다.

AWS Step Functions용 ​​Datadog APM을 통해 추적 활성화

이 기능은 공개 베타 버전입니다.
AWS Step Functions에 대한 분산 추적을 활성화하려면 [서버리스 문서][9]의 설치 지침을 참조하세요.

AWS X-Ray를 통한 추적 활성화

이 옵션은 AWS Step Functions에 대한 향상된 메트릭을 수집하지 않습니다. 이러한 메트릭의 경우 AWS Step Functions용 Datadog APM을 통해 추적을 활성화해야 합니다.

AWS X-Ray를 통해 AWS Step Functions에서 트레이스를 수집하려면 다음을 수행합니다.

  1. Datadog AWS X-Ray 통합을 활성화합니다.
  2. AWS 콘솔에 로그인합니다.
  3. Step Functions를 찾습니다.
  4. Step Functions 중 하나를 선택하고 Edit을 클릭합니다.
  5. 페이지 하단의 Tracing 섹션으로 스크롤하여 Enable X-Ray tracing 상자에 체크 표시합니다.
  6. 권장 사항: 더 상세한 추적을 위해 함수에서 AWS X-Ray 추적 라이브러리를 설치하세요.

수집한 데이터

메트릭

aws.states.activities_failed
(count)
The number of activities that failed.
aws.states.activities_heartbeat_timed_out
(count)
The number of activities that were timed out due to a heartbeat timeout.
aws.states.activities_scheduled
(count)
The number of activities that were scheduled.
aws.states.activities_started
(count)
The number of activities that were started.
aws.states.activities_succeeded
(count)
The number of activities that completed successfully.
aws.states.activities_timed_out
(count)
The number of activities that were timed out on close.
aws.states.activity_run_time
(gauge)
The average time interval, in milliseconds, between the time the activity was started and when it was closed.
Shown as millisecond
aws.states.activity_run_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the activity was started and when it was closed.
Shown as millisecond
aws.states.activity_run_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the activity was started and when it was closed.
Shown as millisecond
aws.states.activity_run_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the activity was started and when it was closed.
Shown as millisecond
aws.states.activity_run_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the activity was started and when it was closed.
Shown as millisecond
aws.states.activity_schedule_time
(gauge)
The avg time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.activity_schedule_time.maximum
(gauge)
The maximum time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.activity_schedule_time.minimum
(gauge)
The minimum time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.activity_schedule_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.activity_schedule_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.activity_time
(gauge)
The average time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
Shown as millisecond
aws.states.activity_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
Shown as millisecond
aws.states.activity_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
Shown as millisecond
aws.states.activity_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
Shown as millisecond
aws.states.activity_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
Shown as millisecond
aws.states.enhanced.execution.execution_time
(gauge)
The average execution time of the state machine.
Shown as nanosecond
aws.states.enhanced.execution.execution_time.maximum
(gauge)
The maximum execution time of the state machine.
Shown as nanosecond
aws.states.enhanced.execution.execution_time.minimum
(gauge)
The minimum execution time of the state machine.
Shown as nanosecond
aws.states.enhanced.execution.execution_time.p95
(gauge)
The 95th percentile of the execution time of the state machine.
Shown as nanosecond
aws.states.enhanced.execution.execution_time.p99
(gauge)
The 99th percentile of the execution time of the state machine.
Shown as nanosecond
aws.states.enhanced.execution.failed
(count)
The number of state machine executions that failed.
aws.states.enhanced.execution.started
(count)
The number of state machine executions started.
aws.states.enhanced.execution.succeeded
(count)
The number of state machine executions that succeeded.
aws.states.enhanced.task.execution.task_duration
(gauge)
The average duration of one task in the state machine.
Shown as nanosecond
aws.states.enhanced.task.execution.task_duration.maximum
(gauge)
The maximum duration of one task in the state machine.
Shown as nanosecond
aws.states.enhanced.task.execution.task_duration.minimum
(gauge)
The minimum duration of one task in the state machine.
Shown as nanosecond
aws.states.enhanced.task.execution.task_duration.p95
(gauge)
The 95th percentile of the duration of one task in the state machine.
Shown as nanosecond
aws.states.enhanced.task.execution.task_duration.p99
(gauge)
The 99th percentile of the duration of one task in the state machine.
Shown as nanosecond
aws.states.enhanced.task.execution.task_failed
(count)
The number of state machine task executions that failed.
aws.states.enhanced.task.execution.task_started
(count)
The number of state machine task executions started.
aws.states.enhanced.task.execution.task_succeeded
(count)
The number of state machine task executions that succeeded.
aws.states.execution_throttled
(count)
The number of StateEntered events in addition to retries
aws.states.execution_time
(gauge)
The average time interval, in milliseconds, between the time the execution started and the time it closed.
Shown as millisecond
aws.states.execution_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the execution started and the time it closed.
Shown as millisecond
aws.states.execution_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the execution started and the time it closed.
Shown as millisecond
aws.states.execution_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the execution started and the time it closed.
Shown as millisecond
aws.states.execution_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the execution started and the time it closed.il
Shown as millisecond
aws.states.executions_aborted
(count)
The number of executions that were aborted/terminated.
aws.states.executions_failed
(count)
The number of executions that failed.
aws.states.executions_started
(count)
The number of executions started.
aws.states.executions_succeeded
(count)
The number of executions that completed successfully.
aws.states.executions_timed_out
(count)
The number of executions that timed out for any reason.
aws.states.lambda_function_run_time
(gauge)
The average time interval, in milliseconds, between the time the lambda function was started and when it was closed.
Shown as millisecond
aws.states.lambda_function_run_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the lambda function was started and when it was closed.
Shown as millisecond
aws.states.lambda_function_run_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the lambda function was started and when it was closed.
Shown as millisecond
aws.states.lambda_function_run_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed.
Shown as millisecond
aws.states.lambda_function_run_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed.
Shown as millisecond
aws.states.lambda_function_schedule_time
(gauge)
The avg time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.lambda_function_schedule_time.maximum
(gauge)
The maximum time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.lambda_function_schedule_time.minimum
(gauge)
The minimum time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.lambda_function_schedule_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.lambda_function_schedule_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
Shown as millisecond
aws.states.lambda_function_time
(gauge)
The average time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
Shown as millisecond
aws.states.lambda_function_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
Shown as millisecond
aws.states.lambda_function_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
Shown as millisecond
aws.states.lambda_function_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
Shown as millisecond
aws.states.lambda_function_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
Shown as millisecond
aws.states.lambda_functions_failed
(count)
The number of lambda functions that failed.
aws.states.lambda_functions_heartbeat_timed_out
(count)
The number of lambda functions that were timed out due to a heartbeat timeout.
aws.states.lambda_functions_scheduled
(count)
The number of lambda functions that were scheduled.
aws.states.lambda_functions_started
(count)
The number of lambda functions that were started.
aws.states.lambda_functions_succeeded
(count)
The number of lambda functions that completed successfully.
aws.states.lambda_functions_timed_out
(count)
The number of lambda functions that were timed out on close.

이벤트

AWS Step Functions 통합에는 이벤트가 포함되지 않습니다.

서비스 점검

AWS Step Functions 통합에는 서비스 점검이 포함되지 않습니다.

트러블슈팅

도움이 필요하세요? Datadog 지원팀에 문의하세요.