AWS
New announcements from Dash: Incident Management, Continuous Profiler, and more! New announcements from Dash!

AWS

Crawler Crawler

Overview

Connect to Amazon Web Services (AWS) to:

  • See automatic AWS status updates in your stream
  • Get CloudWatch metrics for EC2 hosts without installing the Agent
  • Tag your EC2 hosts with EC2-specific information (e.g. availability zone)
  • See EC2 scheduled maintenance events in your stream
  • Collect CloudWatch metrics and events from many other AWS products
Datadog's Amazon integration is built to collect ALL metrics from CloudWatch. Datadog strives to continually update the docs to show every sub-integration, but cloud services rapidly release new metrics and services so the list of integrations are sometimes lagging.
IntegrationDescription
API GatewayCreate, publish, maintain, and secure APIs
AppstreamFully managed application streaming on AWS
AppSyncA GraphQL service with real-time data synchronization and offline programming features
AthenaServerless interactive query service
AutoscalingScale EC2 capacity
BillingBilling and budgets
CloudFrontLocal content delivery network
CloudhsmManaged hardware security module (HSM)
CloudSearchAccess to log files and AWS API calls
CloudTrailAccess to log files and AWS API calls
CodeBuildFully managed build service
CodeDeployAutomate code deployments
CognitoSecure user sign-up and sign-in
ConnectA self-service, cloud-based contact center service
Direct ConnectDedicated network connection to AWS
DMSDatabase Migration Service
DocumentDBMongoDB-compatible database
Dynamo DBNoSQL Database
EBS (Elastic Block Store)Persistent block level storage volumes
EC2 (Elastic Cloud Compute)Resizable compute capacity in the cloud
EC2 SpotTake advantage of unused EC2 capacity
ECS (Elastic Container Service)Container management service that supports Docker containers
EFS (Elastic File System)Shared file storage
EKSElastic Container Service for Kubernetes
Elastic TranscoderMedia and video transcoding in the cloud
ElastiCacheIn-memory cache in the cloud
Elastic BeanstalkService for deploying and scaling web applications and services
ELB (Elastic Load Balancing)Distributes incoming application traffic across multiple Amazon EC2 instances
EMR (Elastic Map Reduce)Data processing using Hadoop
ES (Elasticsearch)Deploy, operate, and scale Elasticsearch clusters
FirehoseCapture and load streaming data
GameliftDedicated game server hosting
GlueExtract, transform, and load data for analytics
GuardDutyIntelligent threat detection
HealthVisibility into the state of your AWS resources, services, and accounts
InspectorAutomated security assessment
IOT (Internet of Things)Connect IOT devices with cloud services
KinesisService for real-time processing of large, distributed data streams
KMS (Key Management Service)Create and control encryption keys
LambdaServerless computing
LexBuild conversation bots
Machine LearningCreate machine learning models
MediaConnectTransport for live video
MediaConvertVideo processing for broadcast and multiscreen delivery
MediaPackagePrepare and protect video for delivery over the internet
MediaTailorScalable server-side ad insertion
MQManaged message broker for ActiveMQ
Managed Streaming for KafkaBuild and run applications that use Apache Kafka to process streaming data
NAT GatewayEnable instances in a private subnet to connect to the internet or other AWS services
NeptuneFast, reliable graph database built for the cloud
OpsWorksConfiguration management
PollyText-speech service
RDS (Relational Database Service)Relational database in the cloud
RedshiftData warehouse solution
RekognitionImage and video analysis for applications
Route 53DNS and traffic management with availability monitoring
S3 (Simple Storage Service)Highly available and scalable cloud storage service
SageMakerMachine learning models and algorithms
SES (Simple Email Service)Cost-effective, outbound-only email-sending service
SNS (Simple Notification System)Alerts and notifications
SQS (Simple Queue Service)Messaging queue service
Storage GatewayHybrid cloud storage
SWF (Simple Workflow Service)Cloud workflow management
VPC (Virtual Private Cloud)Launch AWS resources into a virtual network
Web Application Firewall (WAF)Protect web applications from common web exploits
WorkspacesSecure desktop computing service
X-RayTracing for distributed applications

Setup

Setting up the Datadog integration with Amazon Web Services requires configuring role delegation using AWS IAM. To get a better understanding of role delegation, refer to the AWS IAM Best Practices guide.

Role delegation

Choose a method for setting up the necessary AWS role. CloudFormation is recommended.

  1. Open the Datadog AWS integration tile.
  2. Under the Configuration tab, choose Automatically Using CloudFormation. If you already have an attached AWS account, click Add another account first.
  3. Login to the AWS console.
  4. On the CloudFormation page, create a new stack and provide your Datadog API key.
  5. Update the Datadog AWS integration tile with the IAM role name and account ID used to create the CloudFormation stack.

AWS

  1. Create a new role in the AWS IAM Console.
  2. Select Another AWS account for the Role Type.
  3. For Account ID, enter 464622532012 (Datadog’s account ID). This means that you are granting Datadog read only access to your AWS data.
  4. Select Require external ID and enter the one generated in the AWS integration tile. Make sure you leave Require MFA disabled. For more information about the External ID, refer to this document in the IAM User Guide.
  5. Click Next: Permissions.
  6. If you’ve already created the policy, search for it on this page and select it, then skip to step 12. Otherwise, click Create Policy, which opens in a new window.
  7. Select the JSON tab. To take advantage of every AWS integration offered by Datadog, use policy snippet below in the textbox. As other components are added to an integration, these permissions may change.
  8. Click Review policy.
  9. Name the policy DatadogAWSIntegrationPolicy or one of your own choosing, and provide an apt description.
  10. Click Create policy. You can now close this window.
  11. Back in the “Create role” window, refresh the list of policies and select the policy you just created.
  12. Click Next: Review.
  13. Give the role a name such as DatadogAWSIntegrationRole, as well as an apt description. Click Create Role.

Bonus: If you use Terraform, set up your Datadog IAM policy using - The AWS Integration with Terraform.

Datadog

  1. Open the AWS integration tile.
  2. Select the Role Delegation tab and select Manually.
  3. Enter your AWS Account ID without dashes, for example: 123456789012. Your Account ID can be found in the ARN of the role created during the installation of the AWS integration.
  4. Enter the name of the created role. Note: The role name you enter in the integration tile is case sensitive and must exactly match the role name created on the AWS side.
  5. Choose the services to collect metrics from on the left side of the dialog.
  6. Optionally, add tags to all hosts and metrics.
  7. Optionally, monitor a subset of EC2 instances by entering the AWS tags in the textbox to hosts with tag. Note: This also applies to an instance’s attached EBS volumes.
  8. Optionally, monitor a subset of Lambdas by entering the AWS tags in the textbox to Lambdas with tag.
  9. Click Install Integration.

Datadog AWS IAM Policy

The permissions listed below are included in the Policy Document using wild cards such as List* and Get*. If you require strict policies, use the complete action names as listed and reference the Amazon API documentation for the services you require.

All Permissions

If you are not comfortable with granting all permissions, at the very least use the existing policies named AmazonEC2ReadOnlyAccess and CloudWatchReadOnlyAccess, for more detailed information regarding permissions see the Core Permissions section.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "apigateway:GET",
                "autoscaling:Describe*",
                "budgets:ViewBudget",
                "cloudfront:GetDistributionConfig",
                "cloudfront:ListDistributions",
                "cloudtrail:DescribeTrails",
                "cloudtrail:GetTrailStatus",
                "cloudwatch:Describe*",
                "cloudwatch:Get*",
                "cloudwatch:List*",
                "codedeploy:List*",
                "codedeploy:BatchGet*",
                "directconnect:Describe*",
                "dynamodb:List*",
                "dynamodb:Describe*",
                "ec2:Describe*",
                "ecs:Describe*",
                "ecs:List*",
                "elasticache:Describe*",
                "elasticache:List*",
                "elasticfilesystem:DescribeFileSystems",
                "elasticfilesystem:DescribeTags",
                "elasticfilesystem:DescribeAccessPoints",
                "elasticloadbalancing:Describe*",
                "elasticmapreduce:List*",
                "elasticmapreduce:Describe*",
                "es:ListTags",
                "es:ListDomainNames",
                "es:DescribeElasticsearchDomains",
                "health:DescribeEvents",
                "health:DescribeEventDetails",
                "health:DescribeAffectedEntities",
                "kinesis:List*",
                "kinesis:Describe*",
                "lambda:AddPermission",
                "lambda:GetPolicy",
                "lambda:List*",
                "lambda:RemovePermission",
                "logs:DeleteSubscriptionFilter",
                "logs:DescribeLogGroups",
                "logs:DescribeLogStreams",
                "logs:DescribeSubscriptionFilters",
                "logs:FilterLogEvents",
                "logs:PutSubscriptionFilter",
                "logs:TestMetricFilter",
                "rds:Describe*",
                "rds:List*",
                "redshift:DescribeClusters",
                "redshift:DescribeLoggingStatus",
                "route53:List*",
                "s3:GetBucketLogging",
                "s3:GetBucketLocation",
                "s3:GetBucketNotification",
                "s3:GetBucketTagging",
                "s3:ListAllMyBuckets",
                "s3:PutBucketNotification",
                "ses:Get*",
                "sns:List*",
                "sns:Publish",
                "sqs:ListQueues",
                "states:ListStateMachines",
                "states:DescribeStateMachine",
                "support:*",
                "tag:GetResources",
                "tag:GetTagKeys",
                "tag:GetTagValues",
                "xray:BatchGetTraces",
                "xray:GetTraceSummaries"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}
Core permissions

The core Datadog AWS integration pulls data from AWS CloudWatch. At a minimum, your Policy Document needs to allow the following actions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "cloudwatch:Get*",
                "cloudwatch:List*",
                "ec2:Describe*",
                "support:*",
                "tag:GetResources",
                "tag:GetTagKeys",
                "tag:GetTagValues"
            ],
            "Effect": "Allow",
            "Resource": "*"
        }
    ]
}
AWS PermissionDescription
cloudwatch:ListMetricsList the available CloudWatch metrics.
cloudwatch:GetMetricDataFetch data points for a given metric.
support:*:Add metrics about service limits.
It requires full access because of AWS limitations
tag:getResourcesGet custom tags by resource type.
tag:getTagKeysGet tag keys by region within an AWS account.
tag:getTagValuesGet tag values by region within an AWS account.

The main use of the Resource Group Tagging API is to reduce the number of API calls needed to collect custom tags. For more information, review the Tag policies documentation on the AWS website.

GovCloud and China

  1. Open the AWS integration tile.
  2. Select the Access Keys (GovCloud or China Only) tab.
  3. Enter your AWS Access Key and AWS Secret Key. Only access and secret keys for GovCloud and China are accepted.
  4. Choose the services to collect metrics from on the left side of the dialog.
  5. Optionally, add tags to all hosts and metrics.
  6. Optionally, monitor a subset of EC2 instances by entering the AWS tags in the textbox to hosts with tag. Note: This also applies to an instance’s attached EBS volumes.
  7. Optionally, monitor a subset of Lambdas by entering the AWS tags in the textbox to Lambdas with tag.
  8. Click Install Integration.

Log collection

There are two ways of sending AWS service logs to Datadog:

  • Kinesis Firehose destination: Use the Datadog destination in your Kinesis Firehose delivery stream to forward logs to Datadog
  • Cloudformation: Deploy the Datadog Lambda function which subscribes to S3 buckets or Cloudwatch log group and forward logs to Datadog

We strongly recommend to use the Kinesis Firehose destination when you have to send your logs to multiple destinations. Indeed CloudWatch Log groups can only have one subscriber, but Kinesis streams can have multiple subscribers. By subscribing the Kinesis stream to the log groups, you can have multiple consumers of your log data by subscribing them all to the Kinesis stream.

Data Collected

Metrics

aws.logs.incoming_bytes
(gauge)
The volume of log events in uncompressed bytes uploaded to Cloudwatch Logs.
Shown as byte
aws.logs.incoming_log_events
(count)
The number of log events uploaded to Cloudwatch Logs.
Shown as event
aws.logs.forwarded_bytes
(gauge)
The volume of log events in compressed bytes forwarded to the subscription destination.
Shown as byte
aws.logs.forwarded_log_events
(count)
The number of log events forwarded to the subscription destination.
Shown as event
aws.logs.delivery_errors
(count)
The number of log events for which CloudWatch Logs received an error when forwarding data to the subscription destination.
Shown as event
aws.logs.delivery_throttling
(count)
The number of log events for which CloudWatch Logs was throttled when forwarding data to the subscription destination.
Shown as event
aws.events.invocations
(count)
Measures the number of times a target is invoked for a rule in response to an event. This includes successful and failed invocations but does not include throttled or retried attempts until they fail permanently.
aws.events.failed_invocations
(count)
Measures the number of invocations that failed permanently. This does not include invocations that are retried or that succeeded after a retry attempt
aws.events.triggered_rules
(count)
Measures the number of triggered rules that matched with any event.
aws.events.matched_events
(count)
Measures the number of events that matched with any rule.
aws.events.throttled_rules
(count)
Measures the number of triggered rules that are being throttled.
aws.usage.call_count
(count)
The number of specified operations performed in your account
Shown as operation
aws.usage.resource_count
(count)
The number of specified resources in your account
Shown as resource

Events

Events from AWS are collected on a per AWS-service basis. Please refer to the documentation of specific AWS services to learn more about the events collected.

Tag

The following tags are collected from AWS integrations. Note: Some tags only display on specific metrics.

IntegrationDatadog Tag Keys
Allregion
API Gatewayapiid, apiname, method, resource, stage
Auto Scalingautoscalinggroupname, autoscaling_group
Billingaccount_id, budget_name, budget_type, currency, servicename, time_unit
CloudFrontdistributionid
CodeBuildproject_name
CodeDeployapplication, creator, deployment_config, deployment_group, deployment_option, deployment_type, status
DirectConnectconnectionid
DynamoDBglobalsecondaryindexname, operation, streamlabel, tablename
EBSvolumeid, volume-name, volume-type
EC2autoscaling_group, availability-zone, image, instance-id, instance-type, kernel, name, security_group_name
ECSclustername, servicename, instance_id
EFSfilesystemid
[ElastiCache][]cachenodeid, cache_node_type, cacheclusterid, cluster_name, engine, engine_version, prefered_availability-zone, replication_group
ElasticBeanstalkenvironmentname, enviromentid
ELBavailability-zone, hostname, loadbalancername, name, targetgroup
EMRcluster_name, jobflowid
ESdedicated_master_enabled, ebs_enabled, elasticsearch_version, instance_type, zone_awareness_enabled
Firehosedeliverystreamname
Healthevent_category, status, service
IoTactiontype, protocol, rulename
Kinesisstreamname, name, state
KMSkeyid
Lambdafunctionname, resource, executedversion, memorysize, runtime
Machine Learningmlmodelid, requestmode
MQbroker, queue, topic
OpsWorksstackid, layerid, instanceid
Pollyoperation
RDSauto_minor_version_upgrade, dbinstanceclass, dbclusteridentifier, dbinstanceidentifier, dbname, engine, engineversion, hostname, name, publicly_accessible, secondary_availability-zone
Redshiftclusteridentifier, latency, nodeid, service_class, stage, wlmid
Route 53healthcheckid
S3bucketname, filterid, storagetype
SESTag keys are custom set in AWS.
SNStopicname
SQSqueuename
VPCnategatewayid, vpnid, tunnelipaddress
WorkSpacesdirectoryid, workspaceid

Troubleshooting

Discrepancy between your data in CloudWatch and Datadog

There are two important distinctions to be aware of:

  1. In AWS for counters, a graph that is set to ‘sum’ ‘1minute’ shows the total number of occurrences in one minute leading up to that point, i.e. the rate per 1 minute. Datadog is displaying the raw data from AWS normalized to per second values, regardless of the time frame selected in AWS. This is why you might see Datadog’s value as lower.
  2. Overall, min/max/avg have a different meaning within AWS than in Datadog. In AWS, average latency, minimum latency, and maximum latency are three distinct metrics that AWS collects. When Datadog pulls metrics from AWS CloudWatch, the average latency is received as a single time series per ELB. Within Datadog, when you are selecting ‘min’, ‘max’, or ‘avg’, you are controlling how multiple time series are combined. For example, requesting system.cpu.idle without any filter would return one series for each host that reports that metric and those series need to be combined to be graphed. On the other hand, if you requested system.cpu.idle from a single host, no aggregation would be necessary and switching between average and max would yield the same result.

Metrics delayed

When using the AWS integration, Datadog pulls in your metrics via the CloudWatch API. You may see a slight delay in metrics from AWS due to some constraints that exist for their API.

To begin, the CloudWatch API only offers a metric-by-metric crawl to pull data. The CloudWatch APIs have a rate limit that varies based on the combination of authentication credentials, region, and service. Metrics are made available by AWS dependent on the account level. For example, if you are paying for “detailed metrics” within AWS, they are available more quickly. This level of service for detailed metrics also applies to granularity, with some metrics being available per minute and others per five minutes.

Datadog has the ability to prioritize certain metrics within an account to pull them in faster, depending on the circumstances. Please contact Datadog support for more info.

To obtain metrics with virtually zero delay, install the Datadog Agent on the host. For more information, see Datadog’s blog post Don’t fear the Agent: Agent-based monitoring.

Missing metrics

CloudWatch’s API returns only metrics with data points, so if for instance an ELB has no attached instances, it is expected not to see metrics related to this ELB in Datadog.

Wrong count of aws.elb.healthy_host_count

When the cross-zone load balancing option is enabled on an ELB, all the instances attached to this ELB are considered part of all availability zones (on CloudWatch’s side), so if you have 2 instances in 1a and 3 in ab, the metric displays 5 instances per availability zone. As this can be counter intuitive, we’ve added new metrics, aws.elb.healthy_host_count_deduped and aws.elb.un_healthy_host_count_deduped, that display the count of healthy and unhealthy instances per availability zone, regardless of if this cross-zone load balancing option is enabled or not.

Duplicated hosts when installing the Agent

When installing the Agent on an AWS host, you might see duplicated hosts on the infra page for a few hours if you manually set the hostname in the Agent’s configuration. This second host disappears a few hours later, and won’t affect your billing.