Datadog-AWS Integration

AWSインテグレーションの概要

以下の内容を実現するために、Amazon Web Services (AWS)からデータを収集できるようにします:

  • AWSのステータスアップデート状況を自動的にDatadogイベントストリームに表示できるようにします。
  • Agentをインストールしていない状態でも、EC2ホストからCloudWatchメトリクスを収集できるようにします。
  • EC2ホストに対しEC2特有の情報をタグとして自動的に付与します。(“availability zone”など)
  • EC2のスケジュールドメンテナンスの発生をDatadogイベントストリーム内に表示できるようにします。
  • 多数の他のAWSサービスのCloudWatchメトリクスとイベント情報を収集できるようにします。

AWSインテグレーションには、関連する下記AWSサービス群のインテグレーションも含まれます(ほとんどは個別に設定する必要はありません):

API Gateway create, publish, maintain, and secure APIs
Autoscaling scale EC2 capacity
Billing billing and budgets
CloudFront glocal content delivery network
CloudTrail access to log files and AWS API calls
CloudSearch access to log files and AWS API calls
Direct Connect dedicated network connection to AWS
Dynamo DB NoSQL Database
EC2 Container Service (ECS) container management service that supports Docker containers
Elastic Beanstalk easy-to-use service for deploying and scaling web applications and services
Elastic Block Store (EBS) persistent block level storage volumes
ElastiCache in-memory cache in the cloud
Elastic Cloud Compute (EC2) resizable compute capacity in the cloud
Elastic File System (EFS) shared file storage
Elastic Load Balancing (ELB) distributes incoming application traffic across multiple Amazon EC2 instances
Elastic Map Reduce (EMR) data processing using Hadoop
Elasticsearch Service (ES) deploy, operate, and scale Elasticsearch clusters
Firehose capture and load streaming data
IOT connect IOT devices with cloud services
Kinesis service for real-time processing of large, distributed data streams
Key Management Service (KMS) create and control encryption keys
Lambda serverless computing
Machine Learning (ML) create machine learning models
OpsWorks configuration management
Polly text-speech service
Redshift data warehouse solution
Relational Database Service (RDS) relational database in the cloud
Route 53 DNS and traffic management with availability monitoring
Simple Email Service (SES) cost-effective, outbound-only email-sending service
Simple Notification System (SNS) alert and notifications
Simple Queue Service (SQS) messaging queue service
Simple Storage Service (S3) highly available and scalable cloud storage service
Simple Workflow Service (SWF) cloud workflow management
Storage Gateway hybrid cloud storage
Web Application Firewall (WAF) protect web applications from common web exploits
Workspaces secure desktop computing service

AWSインテグレーションのセットアップ

IAMポリシーとIAMロールの作成

Amazon Web Services用のインテグレーションを導入するには、AWS IAMを使用してロール委任を設定する必要があります。 ロール委任の機能をよりよく理解するには、AWSが公開しているIAMのベストプラクティスを参照してください。

注:現状、GovCloudと中国リージョンでは、AWS IAMのロール委任機能がサポートされていません。 これらのリージョンに対してインテグレーションを設定しようとしている場合は、GovCloudと中国リージョンでの設定のセクションへ進んでください。

  1. まず、IAMコンソール に移動し、新しいポリシーを作成します。 その新しく作ったポリシーをDatadogAWSIntegrationPolicyとして登録します。ここで設定する名前は自由に選択することができます。Datadogが提供するすべてのAWSインテグレーションを活用するには、下記の ポリシードキュメント JSONの内容を使ってください。尚、AWSインテグレーションに新しいコンポーネントを追加する際に、ポリシードキュメントの項目が変更されることがあります。
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "autoscaling:Describe*",
        "budgets:ViewBudget",
        "cloudtrail:DescribeTrails",
        "cloudtrail:GetTrailStatus",
        "cloudwatch:Describe*",
        "cloudwatch:Get*",
        "cloudwatch:List*",
        "codedeploy:List*",
        "codedeploy:BatchGet*",
        "dynamodb:List*",
        "dynamodb:Describe*",
        "ec2:Describe*",
        "ec2:Get*",
        "ecs:Describe*",
        "ecs:List*",
        "elasticache:Describe*",
        "elasticache:List*",
        "elasticfilesystem:DescribeFileSystems",
        "elasticfilesystem:DescribeTags",
        "elasticloadbalancing:Describe*",
        "elasticmapreduce:List*",
        "elasticmapreduce:Describe*",
        "es:ListTags",
        "es:ListDomainNames",
        "es:DescribeElasticsearchDomains",
        "kinesis:List*",
        "kinesis:Describe*",
        "lambda:List*",
        "logs:Get*",
        "logs:Describe*",
        "logs:FilterLogEvents",
        "logs:TestMetricFilter",
        "rds:Describe*",
        "rds:List*",
        "route53:List*",
        "s3:GetBucketTagging",
        "s3:ListAllMyBuckets",
        "ses:Get*",
        "sns:List*",
        "sns:Publish",
        "sqs:ListQueues",
        "support:*",
        "tag:getResources",
        "tag:getTagKeys",
        "tag:getTagValues"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

上記これらの権限をすべて付与することに不安がある場合は、少なくとも AmazonEC2ReadOnlyAccessCloudWatchReadOnlyAccess という既存のAWS管理ポリシーは付与するようにします。AWSサービスの各インテグレーションが必要としている権限の詳細については、下記の必要な権限についてセクションを参照してください。

  1. IAMコンソールで新しいロールを作成します。新しく作ったロールにDatadogAWSIntegrationRoleのような名前を付けます。
  2. “ロールタイプの選択”のページで、クロスアカウントアクセスのロールを選択します。
  3. AWS アカウントとサードパーティ AWS アカウント間のアクセス権を提供します の右にある[選択]ボタンをクリックします。
  4. 「アカウントID」に464622532012(DatadogのアカウントID)と入力します。このアカウントIDを入力することで、AWSが提供しているデータへDatadogが読み取りのみの権限範囲でアクセスすることを許可します。「外部ID」には、DatadogのAWSインテグレーションタイル内に表示された”AWS External ID”を入力します。尚、MFAの使用は、無効にしたままにしておいてください。外部IDの詳細については、「IAMユーザーガイド」のドキュメントを参照してください。
  5. 上記で作成したポリシー(例:DatadogAWSIntegrationPolicy)を選択します。
  6. 選択した内容を確認し、ロールの作成 ボタンをクリックします。

AWSインテグレーションの導入

logo
  1. DatadogのAWSインテグレーションタイル を開きます。
  2. Role Delegation タブを選択します。
  3. AWSアカウントIDを ハイフンを含まず 入力します。(例、1234-5678-9012ではなく123456789012とする)。AWSアカウントIDは新しく作成したロールのARN中に記載されています。そして、そのロールの名前を入力します。最後に、先の手順で確認した外部IDを入力します。
  4. ダイアログの左側のネームスペースのリストから、メトリクスを収集したいサービスが選択されているかを確認します。オプション設定として、すべてのホストとメトリクスに付与するタグの設定が可能です。AWSアカウント上の特定のEC2インスタンスのサブセットのみを監視したい場合は、それらにAWS上でタグ付けしておき、ダイアログの”Limit metrics collection”ボックスでそのタグを指定します。
  5. Install Integration をクリックします。

ChinaリージョンまたはGovCloudリージョンで利用する場合は

  1. DatadogのAWSインテグレーションタイルを開きます。
  2. Access Keys (GovCloud or China Only) タブを選択します。
  3. AWSアクセスキーとAWSシークレットキーを入力します。ここでは ChinaリージョンまたはGovCloudリージョンのアクセスキーとシークレットキーのみが入力できます
  4. ダイアログの左側のネームスペースのリストから、メトリクスを収集したいサービスが選択されているかを確認します。オプション設定として、すべてのホストとメトリクスに付与するタグの設定が可能です。AWSアカウント上の特定のEC2インスタンスのサブセットのみを監視したい場合は、それらにAWS上でタグ付けしておき、ダイアログの”Limit metrics collection”ボックスでそのタグを指定します。
  5. Install Integration をクリックします。

AWSインテグレーションが収集するデータ

メトリクス(但し、EC2やRDSなどは別途各インテグレーションガイドページに記載)

aws.advisor.service_limit.max
(gauge)
Max usage of aws resources
shown as service
aws.advisor.service_limit.usage
(gauge)
Current usage of aws resources
shown as service
aws.advisor.service_limit.usage_ratio
(gauge)
The percentage of resource utilization against a service limit.
shown as percent
aws.appsync.latency
(gauge)
The average time between when AWS AppSync receives a request from a client and when it returns a response to the client. This doesn't include the network latency encountered for a response to reach the end devices.
shown as millisecond
aws.appsync.latency.p90
(gauge)
The 90th percentile time between when AWS AppSync receives a request from a client and when it returns a response to the client. This doesn't include the network latency encountered for a response to reach the end devices.
shown as millisecond
aws.appsync.latency.maximum
(gauge)
The maximum time between when AWS AppSync receives a request from a client and when it returns a response to the client. This doesn't include the network latency encountered for a response to reach the end devices.
shown as millisecond
aws.appsync.4xxerror
(count)
The number of errors captured as a result of invalid requests due to incorrect client configuration.
shown as error
aws.appsync.5xxerror
(count)
Errors encountered during the execution of a GraphQL query.
shown as error
aws.logs.incoming_bytes
(gauge)
The volume of log events in uncompressed bytes uploaded to Cloudwatch Logs.
shown as byte
aws.logs.incoming_log_events
(count)
The number of log events uploaded to Cloudwatch Logs.
shown as event
aws.logs.forwarded_bytes
(gauge)
The volume of log events in compressed bytes forwarded to the subscription destination.
shown as byte
aws.logs.forwarded_log_events
(count)
The number of log events forwarded to the subscription destination.
shown as event
aws.logs.delivery_errors
(count)
The number of log events for which CloudWatch Logs received an error when forwarding data to the subscription destination.
shown as event
aws.logs.delivery_throttling
(count)
The number of log events for which CloudWatch Logs was throttled when forwarding data to the subscription destination.
shown as event
aws.ec2spot.available_instance_pools_count
(count)
The Spot Instance pools specified in the Spot Fleet request.
shown as instance
aws.ec2spot.bids_submitted_for_capacity
(count)
The capacity for which Amazon EC2 has submitted bids.
shown as instance
aws.ec2spot.eligible_instance_pool_count
(count)
The Spot Instance pools specified in the Spot Fleet request where Amazon EC2 can fulfill bids.
shown as instance
aws.ec2spot.fulfilled_capacity
(count)
The capacity that Amazon EC2 has fulfilled.
shown as instance
aws.ec2spot.max_percent_capacity_allocation
(gauge)
The maximum value of PercentCapacityAllocation across all Spot Instance pools specified in the Spot Fleet request.
shown as percent
aws.ec2spot.pending_capacity
(count)
The difference between TargetCapacity and FulfilledCapacity.
shown as instance
aws.ec2spot.percent_capacity_allocation
(gauge)
The capacity allocated for the Spot Instance pool for the specified dimensions.
shown as percent
aws.ec2spot.target_capacity
(count)
The target capacity of the Spot Fleet request.
shown as instance
aws.ec2spot.terminating_capacity
(count)
The capacity that is being terminated due to Spot Instance interruptions.
shown as instance
aws.ddosprotection.ddo_sattack_bits_per_second
(gauge)
The number of bytes observed during a DDoS event for a particular Amazon Resource Name (ARN).
shown as byte
aws.ddosprotection.ddo_sattack_requests_per_second
(gauge)
The number of requests observed during a DDoS event for a particular Amazon Resource Name (ARN).
shown as request
aws.ddosprotection.ddo_sdetected
(gauge)
Indicates a DDoS event for a particular Amazon Resource Name (ARN).
shown as
aws.dms.cpuutilization
(gauge)
Average percentage of allocated EC2 compute units that are currently in use on the instance.
shown as
aws.dms.free_storage_space
(gauge)
The amount of available storage space
shown as byte
aws.dms.freeable_memory
(gauge)
The amount of available random access memory.
shown as byte
aws.dms.write_iops
(gauge)
The average number of disk I/O operations per second
shown as operation
aws.dms.read_iops
(gauge)
The average number of disk I/O operations per second.
shown as operation
aws.dms.write_throughput
(gauge)
The average number of bytes written to disk per second.
shown as byte
aws.dms.read_throughput
(gauge)
The average number of bytes read from disk per second.
shown as byte
aws.dms.write_latency
(gauge)
The average amount of time taken per write disk I/O operation
shown as second
aws.dms.read_latency
(gauge)
The average amount of time taken per read disk I/O operation
shown as second
aws.dms.swap_usage
(gauge)
The amount of swap space used on the DB Instance
shown as byte
aws.dms.network_transmit_throughput
(gauge)
The outgoing (Transmit) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication
shown as byte
aws.dms.network_receive_throughput
(gauge)
The incoming (Receive) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication.
shown as byte
aws.dms.full_load_throughput_bandwidth_source
(gauge)
Incoming network bandwidth from a full load from the source
shown as kibibyte
aws.dms.full_load_throughput_bandwidth_target
(gauge)
Outgoing network bandwidth from a full load for the target
shown as kibibyte
aws.dms.full_load_throughput_rows_source
(gauge)
Incoming changes from a full load from the source in rows per second
shown as row
aws.dms.full_load_throughput_rows_target
(gauge)
Outgoing changes from a full load for the target
shown as row
aws.dms.cdcincoming_changes
(gauge)
Total row count of changes for the task
shown as row
aws.dms.cdcchanges_memory_source
(gauge)
Amount of rows accumulating in a memory and waiting to be committed from the source
shown as row
aws.dms.cdcchanges_memory_target
(gauge)
Amount of rows accumulating in a memory and waiting to be committed to the target
shown as row
aws.dms.cdcchanges_disk_source
(gauge)
Amount of rows accumulating on disk and waiting to be committed from the source
shown as row
aws.dms.cdcchanges_disk_target
(gauge)
Amount of rows accumulating on disk and waiting to be committed to the target
shown as row
aws.dms.cdcthroughput_bandwidth_source
(gauge)
Incoming task network bandwidth from the source
shown as kibibyte
aws.dms.cdcthroughput_bandwidth_target
(gauge)
Outgoing task network bandwidth for the target
shown as kibibyte
aws.dms.cdcthroughput_rows_source
(gauge)
Incoming task changes from the source
shown as row
aws.dms.cdcthroughput_rows_target
(gauge)
Outgoing task changes for the target
shown as row
aws.dms.cdclatency_source
(gauge)
Latency reading from source
shown as second
aws.dms.cdclatency_target
(gauge)
Latency writing to the target
shown as second
aws.events.invocations
(count)
Measures the number of times a target is invoked for a rule in response to an event. This includes successful and failed invocations but does not include throttled or retried attempts until they fail permanently.
shown as
aws.events.failed_invocations
(count)
Measures the number of invocations that failed permanently. This does not include invocations that are retried or that succeeded after a retry attempt
shown as
aws.events.triggered_rules
(count)
Measures the number of triggered rules that matched with any event.
shown as
aws.events.matched_events
(count)
Measures the number of events that matched with any rule.
shown as
aws.events.throttled_rules
(count)
Measures the number of triggered rules that are being throttled.
shown as
aws.natgateway.active_connection_count
(count)
The count of concurrent active TCP connections through the NAT gateway.
shown as connection
aws.natgateway.bytes_in_from_destination
(count)
The number of bytes received by the NAT Gateway from the destination.
shown as byte
aws.natgateway.bytes_in_from_source
(count)
The number of bytes received by the NAT Gateway from the VPC clients.
shown as byte
aws.natgateway.bytes_out_to_destination
(count)
The number of bytes sent through the NAT Gateway to the destination.
shown as byte
aws.natgateway.bytes_out_to_source
(count)
The number of bytes sent through the NAT Gateway to the VPC clients.
shown as byte
aws.natgateway.connection_attempt_count
(count)
The count of connections attempted through the NAT Gateway.
shown as attempt
aws.natgateway.connection_established_count
(count)
The count of connections established through the NAT Gateway.
shown as connection
aws.natgateway.error_port_allocation
(count)
The count of times a source port could not be allocated by the NAT Gateway.
shown as error
aws.natgateway.idle_timeout_count
(count)
The count of timeouts caused by connections going from active to idle state.
shown as timeout
aws.natgateway.packets_drop_count
(count)
The count of packets dropped by the NAT Gateway.
shown as packet
aws.natgateway.packets_in_from_destination
(count)
The number of packets received by the NAT Gateway from the destination.
shown as packet
aws.natgateway.packets_in_from_source
(count)
The number of packets received by the NAT Gateway from the VPC clients.
shown as packet
aws.natgateway.packets_out_to_destination
(count)
The number of packets sent through the NAT Gateway to the destination.
shown as packet
aws.natgateway.packets_out_to_source
(count)
The number of packets sent through the NAT Gateway to the VPC clients.
shown as packet
aws.states.execution_time
(gauge)
The average time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the execution started and the time it closed.il
shown as millisecond
aws.states.executions_aborted
(count)
The number of executions that were aborted/terminated.
shown as
aws.states.execution_throttled
(count)
The number of StateEntered events in addition to retries
shown as
aws.states.executions_failed
(count)
The number of executions that failed.
shown as
aws.states.executions_started
(count)
The number of executions started.
shown as
aws.states.executions_succeeded
(count)
The number of executions that completed successfully.
shown as
aws.states.executions_timed_out
(count)
The number of executions that timed out for any reason.
shown as
aws.states.lambda_function_run_time
(gauge)
The average time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_schedule_time
(gauge)
The avg time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.maximum
(gauge)
The maximum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.minimum
(gauge)
The minimum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_time
(gauge)
The average time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_functions_failed
(count)
The number of lambda functions that failed.
shown as
aws.states.lambda_functions_heartbeat_timed_out
(count)
The number of lambda functions that were timed out due to a heartbeat timeout.
shown as
aws.states.lambda_functions_scheduled
(count)
The number of lambda functions that were scheduled.
shown as
aws.states.lambda_functions_started
(count)
The number of lambda functions that were started.
shown as
aws.states.lambda_functions_succeeded
(count)
The number of lambda functions that completed successfully.
shown as
aws.states.lambda_functions_timed_out
(count)
The number of lambda functions that were timed out on close.
shown as
aws.states.activity_run_time
(gauge)
The average time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_schedule_time
(gauge)
The avg time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.maximum
(gauge)
The maximum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.minimum
(gauge)
The minimum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_time
(gauge)
The average time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activities_failed
(count)
The number of activities that failed.
shown as
aws.states.activities_heartbeat_timed_out
(count)
The number of activities that were timed out due to a heartbeat timeout.
shown as
aws.states.activities_scheduled
(count)
The number of activities that were scheduled.
shown as
aws.states.activities_started
(count)
The number of activities that were started.
shown as
aws.states.activities_succeeded
(count)
The number of activities that completed successfully.
shown as
aws.states.activities_timed_out
(count)
The number of activities that were timed out on close.
shown as
aws.trustedadvisor.green_checks
(gauge)
The number of Trusted Advisor checks in a green (OK) state.
shown as check
aws.trustedadvisor.yellow_checks
(gauge)
The number of Trusted Advisor checks in a yellow (WARN) state.
shown as check
aws.trustedadvisor.red_checks
(gauge)
the number of Trusted Advisor checks in a red (ERROR) state.
shown as check
aws.trustedadvisor.service_limit_usage
(gauge)
The percentage of resource utilization against a service limit.
shown as percent
aws.vpn.tunnel_data_in
(count)
The number of bytes that have come in through the VPN tunnel
shown as byte
aws.vpn.tunnel_data_out
(count)
The number of bytes that have gone out through the VPN tunnel
shown as byte
aws.vpn.tunnel_state
(gauge)
This metric is 1 when the VPN tunnel is up and 0 when it is down
shown as

必要な権限について

DatadogのAWSインテグレーションは主に AWS CloudWatch からデータを収集します。ポリシードキュメントでは、少なくとも下記アクションを許可する必要があります:

  • cloudwatch:ListMetrics : 収集可能なメトリックのリストアップのために使用します。
  • cloudwatch:GetMetricStatistics : リストアップしたメトリックのデータポイントの取得に使用します。
これらのアクションや、下記に記載されているものは、上記のポリシードキュメントの中で List*Get* といったワイルドカード表記で指定されているものに内包されています。もし厳密なポリシー指定が必要な場合は、このドキュメントでリストアップされている完全なアクションを指定し、対象のサービスの Amazon API ドキュメントも参考にして設定を行って下さい。

これより下記に記載されているさらなるエンドポイントについてDatadogが参照できるよう許可することにより、AWSインテグレーションは CloudWatch メトリクスに様々なタグを追加したり、より詳細なメトリクスを収集することができるようになります。

Autoscaling

  • autoscaling:DescribeAutoScalingGroups: Used to list all autoscaling groups.
  • autoscaling:DescribePolicies: List available policies (for autocompletion in events and monitors).
  • autoscaling:DescribeTags: Used to list tags for a given autoscaling group. This will add ASG custom tags on ASG CloudWatch metrics.
  • autoscaling:DescribeScalingActivities: Used to generate events when an ASG scales up or down.
  • autoscaling:ExecutePolicy: Execute one policy (scale up or down from a monitor or the events feed).
    This is not included in the installation Policy Document and should only be included if you are using monitors or events to execute an autoscaling policy.

For more information on Autoscaling policies, review the documentation on the AWS website.

Billing

  • budgets:ViewBudget: Used to view budget metrics

For more information on Budget policies, review the documentation on the AWS website.

CloudTrail

  • cloudtrail:DescribeTrails: Used to list trails and find in which s3 bucket they store the trails
  • cloudtrail:GetTrailStatus: Used to skip inactive trails

For more information on CloudTrail policies, review the documentation on the AWS website.

CloudTrail also requires some s3 permissions to access the trails. These are required on the CloudTrail bucket only

  • s3:ListBucket: List objects in the CloudTrail bucket to get available trails
  • s3:GetBucketLocation: Get bucket’s region to download trails
  • s3:GetObject: Fetch available trails

For more information on S3 policies, review the documentation on the AWS website.

DynamoDB

  • dynamodb:ListTables: Used to list available DynamoDB tables.
  • dynamodb:DescribeTable: Used to add metrics on a table size and item count.
  • dynamodb:ListTagsOfResource: Used to collect all tags on a DynamoDB resource.

For more information on DynamoDB policies, review the documentation on the AWS website.

EC2

  • ec2:DescribeInstanceStatus: Used by the ELB integration to assert the health of an instance. Used by the EC2 integration to describe the health of all instances.
  • ec2:DescribeSecurityGroups: Adds SecurityGroup names and custom tags to ec2 instances.
  • ec2:DescribeInstances: Adds tags to ec2 instances and ec2 cloudwatch metrics.

For more information on EC2 policies, review the documentation on the AWS website.

ECS

  • ecs:ListClusters: List available clusters.
  • ecs:ListContainerInstances: List instances of a cluster.
  • ecs:DescribeContainerInstances: Describe instances to add metrics on resources and tasks running, adds cluster tag to ec2 instances.

For more information on ECS policies, review the documentation on the AWS website.

Elasticache

  • elasticache:DescribeCacheClusters: List and describe Cache clusters, to add tags and additional metrics.
  • elasticache:ListTagsForResource: List custom tags of a cluster, to add custom tags.
  • elasticache:DescribeEvents: Add events avout snapshots and maintenances.

For more information on Elasticache policies, review the documentation on the AWS website.

EFS

  • elasticfilesystem:DescribeTags: Gets custom tags applied to file systems
  • elasticfilesystem:DescribeFileSystems: Provides a list of active file systems

For more information on EFS policies, review the documentation on the AWS website.

ELB

  • elasticloadbalancing:DescribeLoadBalancers: List ELBs, add additional tags and metrics.
  • elasticloadbalancing:DescribeTags: Add custom ELB tags to ELB metrics.
  • elasticloadbalancing:DescribeInstanceHealth: Add state of your instances.

For more information on ELB policies, review the documentation on the AWS website.

EMR

  • elasticmapreduce:ListClusters: List available clusters.
  • elasticmapreduce:DescribeCluster: Add tags to CloudWatch EMR metrics.

For more information on EMR policies, review the documentation on the AWS website.

ES

  • es:ListTags: Add custom ES domain tags to ES metrics
  • es:ListDomainNames: Add custom ES domain tags to ES metrics
  • es:DescribeElasticsearchDomains: Add custom ES domain tags to ES metrics

For more information on ES policies, review the documentation on the AWS website.

Kinesis

  • kinesis:ListStreams: List available streams.
  • kinesis:DescribeStream: Add tags and new metrics for kinesis streams.
  • kinesis:ListTagsForStream: Add custom tags.

For more information on Kinesis policies, review the documentation on the AWS website.

CloudWatch Logs and Lambda

  • logs:DescribeLogGroups: List available groups.
  • logs:DescribeLogStreams: List available streams for a group.
  • logs:FilterLogEvents: Fetch some specific log events for a stream to generate metrics.

For more information on CloudWatch Logs policies, review the documentation on the AWS website.

RDS

  • rds:DescribeDBInstances: Descrive RDS instances to add tags.
  • rds:ListTagsForResource: Add custom tags on RDS instances.
  • rds:DescribeEvents: Add events related to RDS databases.

For more information on RDS policies, review the documentation on the AWS website.

Route53

  • route53:listHealthChecks: List available health checks.
  • route53:listTagsForResources: Add custom tags on Route53 CloudWatch metrics.

For more information on Route53 policies, review the documentation on the AWS website.

S3

  • s3:ListAllMyBuckets: Used to list available buckets
  • s3:GetBucketTagging: Used to get custom bucket tags

For more information on S3 policies, review the documentation on the AWS website.

SES

  • ses:GetSendQuota: Add metrics about send quotas.
  • ses:GetSendStatistics: Add metrics about send statistics.

For more information on SES policies, review the documentation on the AWS website.

SNS

  • sns:ListTopics: Used to list available topics.
  • sns:Publish: Used to publish notifications (monitors or event feed).

For more information on SNS policies, review the documentation on the AWS website.

SQS

  • sqs:ListQueues: Used to list alive queues.

For more information on SQS policies, review the documentation on the AWS website.

Support

  • support:*: Used to add metrics about service limits.
    It requires full access because of AWS limitations

Tag

  • tag:getResources: Used to get custom tags by resource type.
  • tag:getTagKeys: Used to get tag keys by region within an AWS account.
  • tag:getTagValues: Used to get tag values by region within an AWS account.

The main use of the Resource Group Tagging API is to reduce the number of API calls we need to collect custom tags. For more information on Tag policies, review the documentation on the AWS website.

トラブルシューティング

CloudWatchとDatadogのデータ間で不一致があるのでは?という場合

大きくは2点、区別すべき重要なことがあります:

  1. In AWS for counters, a graph that is set to ‘sum’ ‘1minute’ shows the total number of occurrences in one minute leading up to that point, i.e. the rate per 1 minute. Datadog is displaying the raw data from AWS normalized to per second values, regardless of the timeframe selected in AWS, which is why you will probably see our value as lower.
  2. Overall, min/max/avg have a different meaning within AWS than in Datadog. In AWS, average latency, minimum latency, and maximum latency are three distinct metrics that AWS collects. When Datadog pulls metrics from AWS CloudWatch, we only get the average latency as a single timeseries per ELB. Within Datadog, when you are selecting ‘min’, ‘max’, or ‘avg’, you are controlling how multiple timeseries will be combined. For example, requesting system.cpu.idle without any filter would return one series for each host that reports that metric and those series need to be combined to be graphed. On the other hand, if you requested system.cpu.idle from a single host, no aggregation would be necessary and switching between average and max would yield the same result.

メトリクスが遅延している?という場合

When using the AWS integration, we’re pulling in metrics via the CloudWatch API. You may see a slight delay in metrics from AWS due to some constraints that exist for their API.

To begin, the CloudWatch API only offers a metric-by-metric crawl to pull data. The CloudWatch APIs have a rate limit that varies based on the combination of authentication credentials, region, and service. Metrics are made available by AWS dependent on the account level. For example, if you are paying for “detailed metrics” within AWS, they are available more quickly. This level of service for detailed metrics also applies to granularity, with some metrics being available per minute and others per five minutes.

On the Datadog side, we do have the ability to prioritize certain metrics within an account to pull them in faster, depending on the circumstances. Please contact support@datadoghq.com for more info on this.

To obtain metrics with virtually zero delay, we recommend installing the Datadog Agent on those hosts. We’ve written a bit about this here, especially in relation to CloudWatch.

メトリクスに欠損がある?という場合

CloudWatch’s api returns only metrics with datapoints, so if for instance an ELB has no attached instances, it is expected not to see metrics related to this ELB in Datadog.

aws.elb.healthy_host_count のカウントが間違っているのでは?という場合

When the cross-zone load balancing option is enabled on an ELB, all the instances attached to this ELB are considered part of all availability zones (on CloudWatch’s side), so if you have 2 instances in 1a and 3 in ab, the metric will display 5 instances per availability zone. As this can be counter intuitive, we’ve added new metrics, aws.elb.healthy_host_count_deduped and aws.elb.un_healthy_host_count_deduped, that display the count of healthy and unhealthy instances per availability zone, regardless of if this cross-zone load balancing option is enabled or not.

Datadog Agentのインストール時、ホストが重複して登録されているのでは?という場合

When installing the agent on an aws host, you might see duplicated hosts on the infra page for a few hours if you manually set the hostname in the agent’s configuration. This second host will disapear a few hours later, and won’t affect your billing.