Datadog-AWS Integration

Overview

Connect to Amazon Web Services (AWS) in order to:

  • See automatic AWS status updates in your stream
  • Get CloudWatch metrics for EC2 hosts without installing the Agent
  • Tag your EC2 hosts with EC2-specific information (e.g. availability zone)
  • See EC2 scheduled maintenances events in your stream
  • Collect CloudWatch metrics and events from many other AWS products

Related integrations include:

API Gatewaycreate, publish, maintain, and secure APIs
Autoscalingscale EC2 capacity
Billingbilling and budgets
CloudFrontglocal content delivery network
CloudTrailaccess to log files and AWS API calls
CloudSearchaccess to log files and AWS API calls
Direct Connectdedicated network connection to AWS
Dynamo DBNoSQL Database
EC2 Container Service (ECS)container management service that supports Docker containers
Elastic Beanstalkeasy-to-use service for deploying and scaling web applications and services
Elastic Block Store (EBS)persistent block level storage volumes
ElastiCachein-memory cache in the cloud
Elastic Cloud Compute (EC2)resizable compute capacity in the cloud
Elastic File System (EFS)shared file storage
Elastic Load Balancing (ELB)distributes incoming application traffic across multiple Amazon EC2 instances
Elastic Map Reduce (EMR)data processing using Hadoop
Elasticsearch Service (ES)deploy, operate, and scale Elasticsearch clusters
Firehosecapture and load streaming data
IOTconnect IOT devices with cloud services
Kinesisservice for real-time processing of large, distributed data streams
Key Management Service (KMS)create and control encryption keys
Lambdaserverless computing
Machine Learning (ML)create machine learning models
OpsWorksconfiguration management
Pollytext-speech service
Redshiftdata warehouse solution
Relational Database Service (RDS)relational database in the cloud
Route 53DNS and traffic management with availability monitoring
Simple Email Service (SES)cost-effective, outbound-only email-sending service
Simple Notification System (SNS)alert and notifications
Simple Queue Service (SQS)messaging queue service
Simple Storage Service (S3)highly available and scalable cloud storage service
Simple Workflow Service (SWF)cloud workflow management
Storage Gatewayhybrid cloud storage
Web Application Firewall (WAF)protect web applications from common web exploits
Workspacessecure desktop computing service

Setup

Installation

Setting up the Datadog integration with Amazon Web Services requires configuring role delegation using AWS IAM. To get a better understanding of role delegation, refer to the AWS IAM Best Practices guide.

The GovCloud and China regions do not currently support IAM role delegation. If you are deploying in these regions please skip to the configuration section below.
  1. Create a new role in the AWS IAM Console.
  2. Select Another AWS account for the Role Type.
  3. For Account ID, enter 464622532012 (Datadog’s account ID). This means that you will grant Datadog read only access to your AWS data.
  4. Check off Require external ID and enter the one generated in the Datadog app. Make sure you leave Require MFA disabled. For more information about the External ID, refer to this document in the IAM User Guide.
  5. Click Next: Permissions.
  6. Click Create Policy. Note, if you’ve already created the policy, search for it on this page and use select it. Otherwise complete the following to create a new one.
  7. Choose Create Your Own Policy.
  8. Name the policy DatadogAWSIntegrationPolicy, or one of your choosing and provide an apt description. To take advantage of every AWS integration offered by Datadog, use the following in the Policy Document textbox. As we add other components to the integration, these permissions may change.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Action": [
            "autoscaling:Describe*",
            "budgets:ViewBudget",
            "cloudtrail:DescribeTrails",
            "cloudtrail:GetTrailStatus",
            "cloudwatch:Describe*",
            "cloudwatch:Get*",
            "cloudwatch:List*",
            "codedeploy:List*",
            "codedeploy:BatchGet*",
            "directconnect:Describe*",
            "dynamodb:List*",
            "dynamodb:Describe*",
            "ec2:Describe*",
            "ec2:Get*",
            "ecs:Describe*",
            "ecs:List*",
            "elasticache:Describe*",
            "elasticache:List*",
            "elasticfilesystem:DescribeFileSystems",
            "elasticfilesystem:DescribeTags",
            "elasticloadbalancing:Describe*",
            "elasticmapreduce:List*",
            "elasticmapreduce:Describe*",
            "es:ListTags",
            "es:ListDomainNames",
            "es:DescribeElasticsearchDomains",
            "kinesis:List*",
            "kinesis:Describe*",
            "lambda:List*",
            "logs:Get*",
            "logs:Describe*",
            "logs:FilterLogEvents",
            "logs:TestMetricFilter",
            "rds:Describe*",
            "rds:List*",
            "route53:List*",
            "s3:GetBucketTagging",
            "s3:ListAllMyBuckets",
            "ses:Get*",
            "sns:List*",
            "sns:Publish",
            "sqs:ListQueues",
            "support:*",
            "tag:getResources",
            "tag:getTagKeys",
            "tag:getTagValues"
          ],
          "Effect": "Allow",
          "Resource": "*"
        }
      ]
    }
    If you are not comfortable with granting all of these permissions, at the very least use the existing policies named AmazonEC2ReadOnlyAccess and CloudWatchReadOnlyAccess. For more detailed information regarding permissions, please see the Permissions section below.
  9. Click Next: Review.
  10. Give the role a name such as DatadogAWSIntegrationRole and an apt description and hit Create Role.

Configuration

logo
  1. Open the AWS Integration tile.
  2. Select the Role Delegation tab.
  3. Enter your AWS Account ID without dashes, e.g. 123456789012, not 1234-5678-9012. Your Account ID can be found in the ARN of the newly created role. Then enter the name of the role you just created. Finally enter the External ID you specified above.
  4. Choose the services you want to collect metrics for on the left side of the dialog. You can optionally add tags to all hosts and metrics. Also if you want to only monitor a subset of EC2 instances on AWS, tag them and specify the tag in the limit textbox here.
  5. Click Install Integration.

Configuration for China and GovCloud

  1. Open the AWS Integration tile.
  2. Select the Access Keys (GovCloud or China Only) tab.
  3. Enter your AWS Access Key and AWS Secret Key. Only access and secret keys for China and GovCloud are accepted.
  4. Choose the services you want to collect metrics for on the left side of the dialog. You can optionally add tags to all hosts and metrics. Also if you want to only monitor a subset of EC2 instances on AWS, tag them and specify the tag in the limit textbox here.
  5. Click Install Integration.

Data Collected

Metrics

aws.logs.incoming_bytes
(gauge)
The volume of log events in uncompressed bytes uploaded to Cloudwatch Logs.
shown as byte
aws.logs.incoming_log_events
(count)
The number of log events uploaded to Cloudwatch Logs.
shown as event
aws.logs.forwarded_bytes
(gauge)
The volume of log events in compressed bytes forwarded to the subscription destination.
shown as byte
aws.logs.forwarded_log_events
(count)
The number of log events forwarded to the subscription destination.
shown as event
aws.logs.delivery_errors
(count)
The number of log events for which CloudWatch Logs received an error when forwarding data to the subscription destination.
shown as event
aws.logs.delivery_throttling
(count)
The number of log events for which CloudWatch Logs was throttled when forwarding data to the subscription destination.
shown as event
aws.ec2spot.available_instance_pools_count
(count)
The Spot Instance pools specified in the Spot Fleet request.
shown as instance
aws.ec2spot.bids_submitted_for_capacity
(count)
The capacity for which Amazon EC2 has submitted bids.
shown as instance
aws.ec2spot.eligible_instance_pool_count
(count)
The Spot Instance pools specified in the Spot Fleet request where Amazon EC2 can fulfill bids.
shown as instance
aws.ec2spot.fulfilled_capacity
(count)
The capacity that Amazon EC2 has fulfilled.
shown as instance
aws.ec2spot.max_percent_capacity_allocation
(gauge)
The maximum value of PercentCapacityAllocation across all Spot Instance pools specified in the Spot Fleet request.
shown as percent
aws.ec2spot.pending_capacity
(count)
The difference between TargetCapacity and FulfilledCapacity.
shown as instance
aws.ec2spot.percent_capacity_allocation
(gauge)
The capacity allocated for the Spot Instance pool for the specified dimensions.
shown as percent
aws.ec2spot.target_capacity
(count)
The target capacity of the Spot Fleet request.
shown as instance
aws.ec2spot.terminating_capacity
(count)
The capacity that is being terminated due to Spot Instance interruptions.
shown as instance
aws.dms.cpuutilization
(gauge)
Average percentage of allocated EC2 compute units that are currently in use on the instance.
shown as
aws.dms.free_storage_space
(gauge)
The amount of available storage space
shown as byte
aws.dms.freeable_memory
(gauge)
The amount of available random access memory.
shown as byte
aws.dms.write_iops
(gauge)
The average number of disk I/O operations per second
shown as operation
aws.dms.read_iops
(gauge)
The average number of disk I/O operations per second.
shown as operation
aws.dms.write_throughput
(gauge)
The average number of bytes written to disk per second.
shown as byte
aws.dms.read_throughput
(gauge)
The average number of bytes read from disk per second.
shown as byte
aws.dms.write_latency
(gauge)
The average amount of time taken per write disk I/O operation
shown as second
aws.dms.read_latency
(gauge)
The average amount of time taken per read disk I/O operation
shown as second
aws.dms.swap_usage
(gauge)
The amount of swap space used on the DB Instance
shown as byte
aws.dms.network_transmit_throughput
(gauge)
The outgoing (Transmit) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication
shown as byte
aws.dms.network_receive_throughput
(gauge)
The incoming (Receive) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication.
shown as byte
aws.dms.full_load_throughput_bandwidth_source
(gauge)
Incoming network bandwidth from a full load from the source
shown as kibibyte
aws.dms.full_load_throughput_bandwidth_target
(gauge)
Outgoing network bandwidth from a full load for the target
shown as kibibyte
aws.dms.full_load_throughput_rows_source
(gauge)
Incoming changes from a full load from the source in rows per second
shown as row
aws.dms.full_load_throughput_rows_target
(gauge)
Outgoing changes from a full load for the target
shown as row
aws.dms.cdcincoming_changes
(gauge)
Total row count of changes for the task
shown as row
aws.dms.cdcchanges_memory_source
(gauge)
Amount of rows accumulating in a memory and waiting to be committed from the source
shown as row
aws.dms.cdcchanges_memory_target
(gauge)
Amount of rows accumulating in a memory and waiting to be committed to the target
shown as row
aws.dms.cdcchanges_disk_source
(gauge)
Amount of rows accumulating on disk and waiting to be committed from the source
shown as row
aws.dms.cdcchanges_disk_target
(gauge)
Amount of rows accumulating on disk and waiting to be committed to the target
shown as row
aws.dms.cdcthroughput_bandwidth_source
(gauge)
Incoming task network bandwidth from the source
shown as kibibyte
aws.dms.cdcthroughput_bandwidth_target
(gauge)
Outgoing task network bandwidth for the target
shown as kibibyte
aws.dms.cdcthroughput_rows_source
(gauge)
Incoming task changes from the source
shown as row
aws.dms.cdcthroughput_rows_target
(gauge)
Outgoing task changes for the target
shown as row
aws.dms.cdclatency_source
(gauge)
Latency reading from source
shown as second
aws.dms.cdclatency_target
(gauge)
Latency writing to the target
shown as second
aws.events.invocations
(count)
Measures the number of times a target is invoked for a rule in response to an event. This includes successful and failed invocations but does not include throttled or retried attempts until they fail permanently.
shown as
aws.events.failed_invocations
(count)
Measures the number of invocations that failed permanently. This does not include invocations that are retried or that succeeded after a retry attempt
shown as
aws.events.triggered_rules
(count)
Measures the number of triggered rules that matched with any event.
shown as
aws.events.matched_events
(count)
Measures the number of events that matched with any rule.
shown as
aws.events.throttled_rules
(count)
Measures the number of triggered rules that are being throttled.
shown as
aws.natgateway.active_connection_count
(count)
The count of concurrent active TCP connections through the NAT gateway.
shown as connection
aws.natgateway.bytes_in_from_destination
(count)
The number of bytes received by the NAT Gateway from the destination.
shown as byte
aws.natgateway.bytes_in_from_source
(count)
The number of bytes received by the NAT Gateway from the VPC clients.
shown as byte
aws.natgateway.bytes_out_to_destination
(count)
The number of bytes sent through the NAT Gateway to the destination.
shown as byte
aws.natgateway.bytes_out_to_source
(count)
The number of bytes sent through the NAT Gateway to the VPC clients.
shown as byte
aws.natgateway.connection_attempt_count
(count)
The count of connections attempted through the NAT Gateway.
shown as attempt
aws.natgateway.connection_established_count
(count)
The count of connections established through the NAT Gateway.
shown as connection
aws.natgateway.error_port_allocation
(count)
The count of times a source port could not be allocated by the NAT Gateway.
shown as error
aws.natgateway.idle_timeout_count
(count)
The count of timeouts caused by connections going from active to idle state.
shown as timeout
aws.natgateway.packets_drop_count
(count)
The count of packets dropped by the NAT Gateway.
shown as packet
aws.natgateway.packets_in_from_destination
(count)
The number of packets received by the NAT Gateway from the destination.
shown as packet
aws.natgateway.packets_in_from_source
(count)
The number of packets received by the NAT Gateway from the VPC clients.
shown as packet
aws.natgateway.packets_out_to_destination
(count)
The number of packets sent through the NAT Gateway to the destination.
shown as packet
aws.natgateway.packets_out_to_source
(count)
The number of packets sent through the NAT Gateway to the VPC clients.
shown as packet
aws.states.execution_time
(gauge)
The average time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the execution started and the time it closed.il
shown as millisecond
aws.states.executions_aborted
(count)
The number of executions that were aborted/terminated.
shown as
aws.states.execution_throttled
(count)
The number of StateEntered events in addition to retries
shown as
aws.states.executions_failed
(count)
The number of executions that failed.
shown as
aws.states.executions_started
(count)
The number of executions started.
shown as
aws.states.executions_succeeded
(count)
The number of executions that completed successfully.
shown as
aws.states.executions_timed_out
(count)
The number of executions that timed out for any reason.
shown as
aws.states.lambda_function_run_time
(gauge)
The average time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_schedule_time
(gauge)
The avg time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.maximum
(gauge)
The maximum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.minimum
(gauge)
The minimum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_time
(gauge)
The average time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_functions_failed
(count)
The number of lambda functions that failed.
shown as
aws.states.lambda_functions_heartbeat_timed_out
(count)
The number of lambda functions that were timed out due to a heartbeat timeout.
shown as
aws.states.lambda_functions_scheduled
(count)
The number of lambda functions that were scheduled.
shown as
aws.states.lambda_functions_started
(count)
The number of lambda functions that were started.
shown as
aws.states.lambda_functions_succeeded
(count)
The number of lambda functions that completed successfully.
shown as
aws.states.lambda_functions_timed_out
(count)
The number of lambda functions that were timed out on close.
shown as
aws.states.activity_run_time
(gauge)
The average time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_schedule_time
(gauge)
The avg time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.maximum
(gauge)
The maximum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.minimum
(gauge)
The minimum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_time
(gauge)
The average time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activities_failed
(count)
The number of activities that failed.
shown as
aws.states.activities_heartbeat_timed_out
(count)
The number of activities that were timed out due to a heartbeat timeout.
shown as
aws.states.activities_scheduled
(count)
The number of activities that were scheduled.
shown as
aws.states.activities_started
(count)
The number of activities that were started.
shown as
aws.states.activities_succeeded
(count)
The number of activities that completed successfully.
shown as
aws.states.activities_timed_out
(count)
The number of activities that were timed out on close.
shown as
aws.vpn.tunnel_data_in
(count)
The number of bytes that have come in through the VPN tunnel
shown as byte
aws.vpn.tunnel_data_out
(count)
The number of bytes that have gone out through the VPN tunnel
shown as byte
aws.vpn.tunnel_state
(gauge)
This metric is 1 when the VPN tunnel is up and 0 when it is down
shown as

Permissions

The core Datadog-AWS integration pulls data from AWS CloudWatch. At a minimum, your Policy Document will need to allow the following actions:

  • cloudwatch:ListMetrics to list the available CloudWatch metrics.
  • cloudwatch:GetMetricStatistics to fetch data points for a given metric.
These actions and the ones listed below are included in the Policy Document using wild cards such as List* and Get*. If you require strict policies, please use the complete action names as listed and reference the Amazon API documentation for the services you require.

By allowing Datadog to read the following additional endpoints, the AWS integration will be able to add tags to CloudWatch metrics and generate additional metrics.

Autoscaling

  • autoscaling:DescribeAutoScalingGroups: Used to list all autoscaling groups.
  • autoscaling:DescribePolicies: List available policies (for autocompletion in events and monitors).
  • autoscaling:DescribeTags: Used to list tags for a given autoscaling group. This will add ASG custom tags on ASG CloudWatch metrics.
  • autoscaling:DescribeScalingActivities: Used to generate events when an ASG scales up or down.
  • autoscaling:ExecutePolicy: Execute one policy (scale up or down from a monitor or the events feed).
    This is not included in the installation Policy Document and should only be included if you are using monitors or events to execute an autoscaling policy.

For more information on Autoscaling policies, review the documentation on the AWS website.

Billing

  • budgets:ViewBudget: Used to view budget metrics

For more information on Budget policies, review the documentation on the AWS website.

CloudTrail

  • cloudtrail:DescribeTrails: Used to list trails and find in which s3 bucket they store the trails
  • cloudtrail:GetTrailStatus: Used to skip inactive trails

For more information on CloudTrail policies, review the documentation on the AWS website.

CloudTrail also requires some s3 permissions to access the trails. These are required on the CloudTrail bucket only

  • s3:ListBucket: List objects in the CloudTrail bucket to get available trails
  • s3:GetBucketLocation: Get bucket’s region to download trails
  • s3:GetObject: Fetch available trails

For more information on S3 policies, review the documentation on the AWS website.

Direct Connect

  • directconnect:DescribeConnections: Used to list available Direct Connect connections.
  • directconnect:DescribeTags: Used to gather custom tags applied to Direct Connect connections.

For more information on Direct Connect policies, review the documentation on the AWS website.

DynamoDB

  • dynamodb:ListTables: Used to list available DynamoDB tables.
  • dynamodb:DescribeTable: Used to add metrics on a table size and item count.
  • dynamodb:ListTagsOfResource: Used to collect all tags on a DynamoDB resource.

For more information on DynamoDB policies, review the documentation on the AWS website.

EC2

  • ec2:DescribeInstanceStatus: Used by the ELB integration to assert the health of an instance. Used by the EC2 integration to describe the health of all instances.
  • ec2:DescribeSecurityGroups: Adds SecurityGroup names and custom tags to ec2 instances.
  • ec2:DescribeInstances: Adds tags to ec2 instances and ec2 cloudwatch metrics.

For more information on EC2 policies, review the documentation on the AWS website.

ECS

  • ecs:ListClusters: List available clusters.
  • ecs:ListContainerInstances: List instances of a cluster.
  • ecs:DescribeContainerInstances: Describe instances to add metrics on resources and tasks running, adds cluster tag to ec2 instances.

For more information on ECS policies, review the documentation on the AWS website.

Elasticache

  • elasticache:DescribeCacheClusters: List and describe Cache clusters, to add tags and additional metrics.
  • elasticache:ListTagsForResource: List custom tags of a cluster, to add custom tags.
  • elasticache:DescribeEvents: Add events avout snapshots and maintenances.

For more information on Elasticache policies, review the documentation on the AWS website.

EFS

  • elasticfilesystem:DescribeTags: Gets custom tags applied to file systems
  • elasticfilesystem:DescribeFileSystems: Provides a list of active file systems

For more information on EFS policies, review the documentation on the AWS website.

ELB

  • elasticloadbalancing:DescribeLoadBalancers: List ELBs, add additional tags and metrics.
  • elasticloadbalancing:DescribeTags: Add custom ELB tags to ELB metrics.
  • elasticloadbalancing:DescribeInstanceHealth: Add state of your instances.

For more information on ELB policies, review the documentation on the AWS website.

EMR

  • elasticmapreduce:ListClusters: List available clusters.
  • elasticmapreduce:DescribeCluster: Add tags to CloudWatch EMR metrics.

For more information on EMR policies, review the documentation on the AWS website.

ES

  • es:ListTags: Add custom ES domain tags to ES metrics
  • es:ListDomainNames: Add custom ES domain tags to ES metrics
  • es:DescribeElasticsearchDomains: Add custom ES domain tags to ES metrics

For more information on ES policies, review the documentation on the AWS website.

Kinesis

  • kinesis:ListStreams: List available streams.
  • kinesis:DescribeStream: Add tags and new metrics for kinesis streams.
  • kinesis:ListTagsForStream: Add custom tags.

For more information on Kinesis policies, review the documentation on the AWS website.

CloudWatch Logs and Lambda

  • logs:DescribeLogGroups: List available groups.
  • logs:DescribeLogStreams: List available streams for a group.
  • logs:FilterLogEvents: Fetch some specific log events for a stream to generate metrics.

For more information on CloudWatch Logs policies, review the documentation on the AWS website.

RDS

  • rds:DescribeDBInstances: Descrive RDS instances to add tags.
  • rds:ListTagsForResource: Add custom tags on RDS instances.
  • rds:DescribeEvents: Add events related to RDS databases.

For more information on RDS policies, review the documentation on the AWS website.

Route53

  • route53:listHealthChecks: List available health checks.
  • route53:listTagsForResources: Add custom tags on Route53 CloudWatch metrics.

For more information on Route53 policies, review the documentation on the AWS website.

S3

  • s3:ListAllMyBuckets: Used to list available buckets
  • s3:GetBucketTagging: Used to get custom bucket tags

For more information on S3 policies, review the documentation on the AWS website.

SES

  • ses:GetSendQuota: Add metrics about send quotas.
  • ses:GetSendStatistics: Add metrics about send statistics.

For more information on SES policies, review the documentation on the AWS website.

SNS

  • sns:ListTopics: Used to list available topics.
  • sns:Publish: Used to publish notifications (monitors or event feed).

For more information on SNS policies, review the documentation on the AWS website.

SQS

  • sqs:ListQueues: Used to list alive queues.

For more information on SQS policies, review the documentation on the AWS website.

Support

  • support:*: Used to add metrics about service limits.
    It requires full access because of AWS limitations

Tag

  • tag:getResources: Used to get custom tags by resource type.
  • tag:getTagKeys: Used to get tag keys by region within an AWS account.
  • tag:getTagValues: Used to get tag values by region within an AWS account.

The main use of the Resource Group Tagging API is to reduce the number of API calls we need to collect custom tags. For more information on Tag policies, review the documentation on the AWS website.

Troubleshooting

Do you believe you’re seeing a discrepancy between your data in CloudWatch and Datadog?

There are two important distinctions to be aware of:

  1. In AWS for counters, a graph that is set to ‘sum’ ‘1minute’ shows the total number of occurrences in one minute leading up to that point, i.e. the rate per 1 minute. Datadog is displaying the raw data from AWS normalized to per second values, regardless of the timeframe selected in AWS, which is why you will probably see our value as lower.
  2. Overall, min/max/avg have a different meaning within AWS than in Datadog. In AWS, average latency, minimum latency, and maximum latency are three distinct metrics that AWS collects. When Datadog pulls metrics from AWS CloudWatch, we only get the average latency as a single time series per ELB. Within Datadog, when you are selecting ‘min’, ‘max’, or ‘avg’, you are controlling how multiple time series will be combined. For example, requesting system.cpu.idle without any filter would return one series for each host that reports that metric and those series need to be combined to be graphed. On the other hand, if you requested system.cpu.idle from a single host, no aggregation would be necessary and switching between average and max would yield the same result.

Metrics delayed?

When using the AWS integration, we’re pulling in metrics via the CloudWatch API. You may see a slight delay in metrics from AWS due to some constraints that exist for their API.

To begin, the CloudWatch API only offers a metric-by-metric crawl to pull data. The CloudWatch APIs have a rate limit that varies based on the combination of authentication credentials, region, and service. Metrics are made available by AWS dependent on the account level. For example, if you are paying for “detailed metrics” within AWS, they are available more quickly. This level of service for detailed metrics also applies to granularity, with some metrics being available per minute and others per five minutes.

On the Datadog side, we do have the ability to prioritize certain metrics within an account to pull them in faster, depending on the circumstances. Please contact support@datadoghq.com for more info on this.

To obtain metrics with virtually zero delay, we recommend installing the Datadog Agent on those hosts. We’ve written a bit about this here, especially in relation to CloudWatch.

Missing metrics?

CloudWatch’s api returns only metrics with datapoints, so if for instance an ELB has no attached instances, it is expected not to see metrics related to this ELB in Datadog.

Wrong count of aws.elb.healthy_host_count?

When the cross-zone load balancing option is enabled on an ELB, all the instances attached to this ELB are considered part of all availability zones (on CloudWatch’s side), so if you have 2 instances in 1a and 3 in ab, the metric will display 5 instances per availability zone. As this can be counter intuitive, we’ve added new metrics, aws.elb.healthy_host_count_deduped and aws.elb.un_healthy_host_count_deduped, that display the count of healthy and unhealthy instances per availability zone, regardless of if this cross-zone load balancing option is enabled or not.

Duplicated hosts when installing the agent?

When installing the agent on an aws host, you might see duplicated hosts on the infra page for a few hours if you manually set the hostname in the agent’s configuration. This second host will disapear a few hours later, and won’t affect your billing.