The Service Map for APM is here!

AWS

Crawler Crawler

Overview

Connect to Amazon Web Services (AWS) in order to:

  • See automatic AWS status updates in your stream
  • Get CloudWatch metrics for EC2 hosts without installing the Agent
  • Tag your EC2 hosts with EC2-specific information (e.g. availability zone)
  • See EC2 scheduled maintenances events in your stream
  • Collect CloudWatch metrics and events from many other AWS products

Related integrations include:

API Gateway create, publish, maintain, and secure APIs
Autoscaling scale EC2 capacity
Billing billing and budgets
CloudFront glocal content delivery network
CloudTrail access to log files and AWS API calls
CloudSearch access to log files and AWS API calls
Direct Connect dedicated network connection to AWS
Dynamo DB NoSQL Database
EC2 Container Service (ECS) container management service that supports Docker containers
Elastic Beanstalk easy-to-use service for deploying and scaling web applications and services
Elastic Block Store (EBS) persistent block level storage volumes
ElastiCache in-memory cache in the cloud
Elastic Cloud Compute (EC2) resizable compute capacity in the cloud
Elastic File System (EFS) shared file storage
Elastic Load Balancing (ELB) distributes incoming application traffic across multiple Amazon EC2 instances
Elastic Map Reduce (EMR) data processing using Hadoop
Elasticsearch Service (ES) deploy, operate, and scale Elasticsearch clusters
Firehose capture and load streaming data
IOT connect IOT devices with cloud services
Kinesis service for real-time processing of large, distributed data streams
Key Management Service (KMS) create and control encryption keys
Lambda serverless computing
Machine Learning (ML) create machine learning models
OpsWorks configuration management
Polly text-speech service
Redshift data warehouse solution
Relational Database Service (RDS) relational database in the cloud
Route 53 DNS and traffic management with availability monitoring
Simple Email Service (SES) cost-effective, outbound-only email-sending service
Simple Notification System (SNS) alert and notifications
Simple Queue Service (SQS) messaging queue service
Simple Storage Service (S3) highly available and scalable cloud storage service
Simple Workflow Service (SWF) cloud workflow management
Storage Gateway hybrid cloud storage
Web Application Firewall (WAF) protect web applications from common web exploits
Workspaces secure desktop computing service

Setup

Installation

Setting up the Datadog integration with Amazon Web Services requires configuring role delegation using AWS IAM. To get a better understanding of role delegation, refer to the AWS IAM Best Practices guide.

The GovCloud and China regions do not currently support IAM role delegation. If you are deploying in these regions please skip to the configuration section below.
  1. Create a new role in the AWS IAM Console.
  2. Select Another AWS account for the Role Type.
  3. For Account ID, enter 464622532012 (Datadog’s account ID). This means that you will grant Datadog read only access to your AWS data.
  4. Check off Require external ID and enter the one generated in the Datadog app. Make sure you leave Require MFA disabled. For more information about the External ID, refer to this document in the IAM User Guide.
  5. Click Next: Permissions.
  6. If you’ve already created the policy, search for it on this page and select it, then skip to step 12. Otherwise, click Create Policy, which will open in a new window.
  7. Select the JSON tab. To take advantage of every AWS integration offered by Datadog, use policy snippet below in the textbox. As we add other components to the integration, these permissions may change.
  8. Click Review policy.
  9. Name the policy DatadogAWSIntegrationPolicy or one of your own choosing, and provide an apt description.
  10. Click Create policy. You can now close this window.
  11. Back in the “Create role” window, refresh the list of policies and select the policy you just created.
  12. Click Next: Review.
  13. Give the role a name such as DatadogAWSIntegrationRole, as well as an apt description. Click Create Role.
These actions and the ones listed below are included in the Policy Document using wild cards such as List* and Get*. If you require strict policies, please use the complete action names as listed and reference the Amazon API documentation for the services you require.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "autoscaling:Describe*",
        "budgets:ViewBudget",
        "cloudfront:GetDistributionConfig",
        "cloudfront:ListDistributions",
        "cloudtrail:DescribeTrails",
        "cloudtrail:GetTrailStatus",
        "cloudwatch:Describe*",
        "cloudwatch:Get*",
        "cloudwatch:List*",
        "codedeploy:List*",
        "codedeploy:BatchGet*",
        "directconnect:Describe*",
        "dynamodb:List*",
        "dynamodb:Describe*",
        "ec2:Describe*",
        "ecs:Describe*",
        "ecs:List*",
        "elasticache:Describe*",
        "elasticache:List*",
        "elasticfilesystem:DescribeFileSystems",
        "elasticfilesystem:DescribeTags",
        "elasticloadbalancing:Describe*",
        "elasticmapreduce:List*",
        "elasticmapreduce:Describe*",
        "es:ListTags",
        "es:ListDomainNames",
        "es:DescribeElasticsearchDomains",
        "health:DescribeEvents",
        "health:DescribeEventDetails",
        "health:DescribeAffectedEntities",
        "kinesis:List*",
        "kinesis:Describe*",
        "lambda:AddPermission",
        "lambda:GetPolicy",
        "lambda:List*",
        "lambda:RemovePermission",
        "logs:Get*",
        "logs:Describe*",
        "logs:FilterLogEvents",
        "logs:TestMetricFilter",
        "logs:PutSubscriptionFilter",
        "logs:DeleteSubscriptionFilter",
        "logs:DescribeSubscriptionFilters",
        "rds:Describe*",
        "rds:List*",
        "redshift:DescribeClusters",
        "redshift:DescribeLoggingStatus",
        "route53:List*",
        "s3:GetBucketLogging",
        "s3:GetBucketLocation",
        "s3:GetBucketNotification",
        "s3:GetBucketTagging",
        "s3:ListAllMyBuckets",
        "s3:PutBucketNotification",
        "ses:Get*",
        "sns:List*",
        "sns:Publish",
        "sqs:ListQueues",
        "support:DescribeTrustedAdvisorChecks", 
        "support:RefreshTrustedAdvisorCheck", 
        "support:DescribeTrustedAdvisorCheckResult",
        "tag:GetResources",
        "tag:GetTagKeys",
        "tag:GetTagValues"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

If you are not comfortable with granting all of these permissions, at the very least use the existing policies named AmazonEC2ReadOnlyAccess and CloudWatchReadOnlyAccess, for more detailed information regarding permissions, please see the Permissions section below.

Permissions

The core Datadog-AWS integration pulls data from AWS CloudWatch. At a minimum, your Policy Document needs to allow the following actions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "cloudwatch:Get*",
        "cloudwatch:List*",
        "ec2:Describe*",
        "support:*",
        "tag:GetResources",
        "tag:GetTagKeys",
        "tag:GetTagValues"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}
  • CloudWatch:

    • cloudwatch:ListMetrics to list the available CloudWatch metrics.
    • cloudwatch:GetMetricData to fetch data points for a given metric.
  • Support:

    • support:*: Used to add metrics about service limits.
      It requires full access because of AWS limitations
  • Tag:

    • tag:getResources: Used to get custom tags by resource type.
    • tag:getTagKeys: Used to get tag keys by region within an AWS account.
    • tag:getTagValues: Used to get tag values by region within an AWS account.

    The main use of the Resource Group Tagging API is to reduce the number of API calls we need to collect custom tags. For more information on Tag policies, review the documentation on the AWS website.

Configuration

logo
  1. Open the AWS integration tile.
  2. Select the Role Delegation tab.
  3. Enter your AWS Account ID without dashes, e.g. 123456789012, not 1234-5678-9012. Your Account ID can be found in the ARN of the role created during the installation of the AWS integration. Then enter the name of the created role. Note: The role name you enter in the Integration Tile is case sensitive and must exactly match the role name created on the AWS side.
  4. Choose the services you want to collect metrics for on the left side of the dialog. You can optionally add tags to all hosts and metrics. Also if you want to only monitor a subset of EC2 instances on AWS, tag them and specify the tag in the limit textbox here.
  5. Click Install Integration.

Configuration for China and GovCloud

  1. Open the AWS integration tile.
  2. Select the Access Keys (GovCloud or China Only) tab.
  3. Enter your AWS Access Key and AWS Secret Key. Only access and secret keys for China and GovCloud are accepted.
  4. Choose the services you want to collect metrics for on the left side of the dialog. You can optionally add tags to all hosts and metrics. Also if you want to only monitor a subset of EC2 instances on AWS, tag them and specify the tag in the limit textbox here.
  5. Click Install Integration.

Log collection

To start collecting logs from one of your AWS services here is the general process:

  1. Set up the Datadog lambda function
  2. Enable Logging for your AWS service (most AWS services can log to a S3 bucket or CloudWatch Log Group)
  3. Configure the triggers that cause the Lambda to execute and send logs to Datadog. There are two ways to configure the triggers:

Create a new Lambda function

Use the AWS serverless repository to automatically deploy the Lambda in your AWS account, or create it manually with the following instructions:

Create a new Lambda Function

  1. Navigate to the Lambda Console and create a new function:

    Create Lambda function

  2. Select Author from scratch and give the function a unique name.

  3. Change the Runtime to Python 2.7

  4. For Role, select Create new role from template(s) and give the role a unique name.

  5. If you are pulling logs from a S3 bucket, under Policy templates search for and select s3 object read-only permissions.

  6. Select Create Function.

    Author from Scratch

Provide the code and configure the Lambda

  1. Copy and paste the code from this repo into the function code area.
  2. Ensure the Handler reads lambda_function.lambda_handler
    Select Python
  3. At the top of the script you’ll find a section called #Parameters. You have two options for providing the API Key that the Lambda function requires:

    • Setup an environment variable (Preferred)
    • Edit the code directly with your Datadog API Key
      DD API key setup
  4. Scroll down beyond the inline code area to Basic Settings.

  5. Set the memory to around 1GB.

  6. Set the timeout limit. We recommend 120 seconds.

    Basic Settings

  7. Scroll back to the top of the page and hit Save.

Test your Lambda

  1. Press Test.
  2. Search for and select CloudWatch Logs as the sample event.
    Test Event
  3. Give the event a unique name and press Create.
  4. Press Test and ensure the test passes with no errors.

Enable logging for your AWS service

Any AWS service that generates logs into a S3 bucket or a CloudWatch Log Group is supported. Find specific setup instructions for the most used services in the table below:

AWS service Activate AWS service logging Send AWS logs to Datadog
API Gateway Enable AWS API Gateway logs Manual log collection
Cloudfront Enable AWS Cloudfront logs Manual and automatic log collection
Cloudtrail Enable AWS Cloudtrail logs Manual log collection
DynamoDB Enable AWS DynamoDB logs Manual log collection
EC2 - Use the Datadog Agent or another log shipper to send your logs to Datadog
ECS - Use the docker agent to gather your logs
Elastic Load Balancing (ELB) Enable AWS ELB logs Manual and automatic log collection
Lambda - Manual and automatic log collection
RDS Enable AWS RDS logs Manual log collection
Route 53 Enable AWS Route 53 logs Manual log collection
S3 Enable AWS S3 logs Manual and automatic log collection
SNS There is no “SNS Logs”. Process logs and events that are transiting through to the SNS Service. Manual log collection
RedShift Enable AWS Redshift logs Manual and automatic log collection
VPC Enable AWS VPC logs Manual log collection

Send AWS service logs to Datadog

Collecting logs from CloudWatch

If you are storing logs in a CloudWatch Log Group, send them to Datadog with the following steps:

  1. If you haven’t already, set up the Datadog log collection AWS Lambda function.
  2. Once the lambda function is installed, manually add a trigger on the CloudWatch Log Group that contains your API Gateway logs in the AWS console:
    cloudwatch log group
    Select the corresponding CloudWatch Log Group, add a filter name (but feel free to leave the filter empty) and add the trigger:
    cloudwatch trigger

Once done, go into your Datadog Log section to start exploring your logs!

Collecting logs from S3

If you are storing logs in many S3 buckets, Datadog can automatically manage triggers for you.

  1. If you haven’t already, set up the Datadog log collection AWS Lambda function.
  2. Add the required permissions to your Datadog role in the IAM Console. You may already have some of these permissions from our other AWS integrations. Information on how these permissions are used can be found in the permissions section below:

    "cloudfront:GetDistributionConfig",
    "cloudfront:ListDistributions",
    "elasticloadbalancing:DescribeLoadBalancers",
    "elasticloadbalancing:DescribeLoadBalancerAttributes",
    "lambda:AddPermission",
    "lambda:GetPolicy",
    "lambda:RemovePermission",
    "redshift:DescribeClusters",
    "redshift:DescribeLoggingStatus",
    "s3:GetBucketLogging",
    "s3:GetBucketLocation",
    "s3:GetBucketNotification",
    "s3:ListAllMyBuckets",
    "s3:PutBucketNotification",
    "logs:PutSubscriptionFilter",
    "logs:DeleteSubscriptionFilter",
    "logs:DescribeSubscriptionFilters"
    
    • cloudfront:GetDistributionConfig: Get the name of the S3 bucket containing CloudFront access logs.
    • cloudfront:ListDistributions: List all CloudFront distributions.
    • elasticloadbalancing:DescribeLoadBalancers: List all load balancers.
    • elasticloadbalancing:DescribeLoadBalancerAttributes: Get the name of the S3 bucket containing ELB access logs.
    • lambda:AddPermission: Add permission allowing a particular S3 bucket to trigger a Lambda function.
    • lambda:GetPolicy: Gets the Lambda policy when triggers are to be removed.
    • lambda:RemovePermission: Remove permissions from a Lambda policy.
    • redshift:DescribeClusters: List all Redshift clusters.
    • redshift:DescribeLoggingStatus: Get the name of the S3 bucket containing Redshift Logs.
    • s3:GetBucketLogging: Get the name of the S3 bucket containing S3 access logs.
    • s3:GetBucketLocation: Get the region of the S3 bucket containing S3 access logs.
    • s3:GetBucketNotification: Get existing Lambda trigger configurations.
    • s3:ListAllMyBuckets: List all S3 buckets.
    • s3:PutBucketNotification: Add or remove a Lambda trigger based on S3 bucket events.
    • logs:PutSubscriptionFilter: Add a Lambda trigger based on CloudWatch Log events
    • logs:DeleteSubscriptionFilter: Remove a Lambda trigger based on CloudWatch Log events
    • logs:DescribeSubscriptionFilters: Get existing Lambda trigger configurations.
  3. Navigate to the Collect Logs tab in the AWS Integration tile

  4. Select the AWS Account from where you want to collect logs, and enter the ARN of the Lambda created in the previous section.

    Enter Lambda

  5. Check off the services from which you’d like to collect logs and hit save. To stop collecting logs from a particular service, uncheck it.

    Select services

  6. If you have logs across multiple regions, you must create additional Lambda functions in those regions and enter them in this tile.

  7. To stop collecting all AWS logs, press the x next to each Lamdba ARN. All triggers for that function will be removed.

  8. Within a few minutes of this initial setup, you will see your AWS Logs appear in our logging platform in near real time.

Note: For CloudFront distributions, store multiple distributions log files in the same bucket. When enabling logging, specify an optional prefix for the file names to keep track of which log files are associated with which distributions. In order for Datadog to tag CloudFront logs as source:cloudfront include cloudfront somewhere in prefix.

Manually set up triggers

In your Lambda, go in the triggers tab and select Add Trigger:

Adding trigger

Select the log source and then follow the AWS instructions:

Integration Lambda

For instance, do not forget to set the correct event type on S3 Buckets:

Object Created

Data Collected

Metrics

aws.advisor.service_limit.max
(gauge)
Max usage of aws resources
shown as service
aws.advisor.service_limit.usage
(gauge)
Current usage of aws resources
shown as service
aws.advisor.service_limit.usage_ratio
(gauge)
The percentage of resource utilization against a service limit.
shown as percent
aws.logs.incoming_bytes
(gauge)
The volume of log events in uncompressed bytes uploaded to Cloudwatch Logs.
shown as byte
aws.logs.incoming_log_events
(count)
The number of log events uploaded to Cloudwatch Logs.
shown as event
aws.logs.forwarded_bytes
(gauge)
The volume of log events in compressed bytes forwarded to the subscription destination.
shown as byte
aws.logs.forwarded_log_events
(count)
The number of log events forwarded to the subscription destination.
shown as event
aws.logs.delivery_errors
(count)
The number of log events for which CloudWatch Logs received an error when forwarding data to the subscription destination.
shown as event
aws.logs.delivery_throttling
(count)
The number of log events for which CloudWatch Logs was throttled when forwarding data to the subscription destination.
shown as event
aws.ec2spot.available_instance_pools_count
(count)
The Spot Instance pools specified in the Spot Fleet request.
shown as instance
aws.ec2spot.bids_submitted_for_capacity
(count)
The capacity for which Amazon EC2 has submitted bids.
shown as instance
aws.ec2spot.eligible_instance_pool_count
(count)
The Spot Instance pools specified in the Spot Fleet request where Amazon EC2 can fulfill bids.
shown as instance
aws.ec2spot.fulfilled_capacity
(count)
The capacity that Amazon EC2 has fulfilled.
shown as instance
aws.ec2spot.max_percent_capacity_allocation
(gauge)
The maximum value of PercentCapacityAllocation across all Spot Instance pools specified in the Spot Fleet request.
shown as percent
aws.ec2spot.pending_capacity
(count)
The difference between TargetCapacity and FulfilledCapacity.
shown as instance
aws.ec2spot.percent_capacity_allocation
(gauge)
The capacity allocated for the Spot Instance pool for the specified dimensions.
shown as percent
aws.ec2spot.target_capacity
(count)
The target capacity of the Spot Fleet request.
shown as instance
aws.ec2spot.terminating_capacity
(count)
The capacity that is being terminated due to Spot Instance interruptions.
shown as instance
aws.ddosprotection.ddo_sattack_bits_per_second
(gauge)
The number of bytes observed during a DDoS event for a particular Amazon Resource Name (ARN).
shown as byte
aws.ddosprotection.ddo_sattack_requests_per_second
(gauge)
The number of requests observed during a DDoS event for a particular Amazon Resource Name (ARN).
shown as request
aws.ddosprotection.ddo_sdetected
(gauge)
Indicates a DDoS event for a particular Amazon Resource Name (ARN).
aws.dms.cpuutilization
(gauge)
Average percentage of allocated EC2 compute units that are currently in use on the instance.
aws.dms.free_storage_space
(gauge)
The amount of available storage space
shown as byte
aws.dms.freeable_memory
(gauge)
The amount of available random access memory.
shown as byte
aws.dms.write_iops
(gauge)
The average number of disk I/O operations per second
shown as operation
aws.dms.read_iops
(gauge)
The average number of disk I/O operations per second.
shown as operation
aws.dms.write_throughput
(gauge)
The average number of bytes written to disk per second.
shown as byte
aws.dms.read_throughput
(gauge)
The average number of bytes read from disk per second.
shown as byte
aws.dms.write_latency
(gauge)
The average amount of time taken per write disk I/O operation
shown as second
aws.dms.read_latency
(gauge)
The average amount of time taken per read disk I/O operation
shown as second
aws.dms.swap_usage
(gauge)
The amount of swap space used on the DB Instance
shown as byte
aws.dms.network_transmit_throughput
(gauge)
The outgoing (Transmit) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication
shown as byte
aws.dms.network_receive_throughput
(gauge)
The incoming (Receive) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication.
shown as byte
aws.dms.full_load_throughput_bandwidth_source
(gauge)
Incoming network bandwidth from a full load from the source
shown as kibibyte
aws.dms.full_load_throughput_bandwidth_target
(gauge)
Outgoing network bandwidth from a full load for the target
shown as kibibyte
aws.dms.full_load_throughput_rows_source
(gauge)
Incoming changes from a full load from the source in rows per second
shown as row
aws.dms.full_load_throughput_rows_target
(gauge)
Outgoing changes from a full load for the target
shown as row
aws.dms.cdcincoming_changes
(gauge)
Total row count of changes for the task
shown as row
aws.dms.cdcchanges_memory_source
(gauge)
Amount of rows accumulating in a memory and waiting to be committed from the source
shown as row
aws.dms.cdcchanges_memory_target
(gauge)
Amount of rows accumulating in a memory and waiting to be committed to the target
shown as row
aws.dms.cdcchanges_disk_source
(gauge)
Amount of rows accumulating on disk and waiting to be committed from the source
shown as row
aws.dms.cdcchanges_disk_target
(gauge)
Amount of rows accumulating on disk and waiting to be committed to the target
shown as row
aws.dms.cdcthroughput_bandwidth_source
(gauge)
Incoming task network bandwidth from the source
shown as kibibyte
aws.dms.cdcthroughput_bandwidth_target
(gauge)
Outgoing task network bandwidth for the target
shown as kibibyte
aws.dms.cdcthroughput_rows_source
(gauge)
Incoming task changes from the source
shown as row
aws.dms.cdcthroughput_rows_target
(gauge)
Outgoing task changes for the target
shown as row
aws.dms.cdclatency_source
(gauge)
Latency reading from source
shown as second
aws.dms.cdclatency_target
(gauge)
Latency writing to the target
shown as second
aws.events.invocations
(count)
Measures the number of times a target is invoked for a rule in response to an event. This includes successful and failed invocations but does not include throttled or retried attempts until they fail permanently.
aws.events.failed_invocations
(count)
Measures the number of invocations that failed permanently. This does not include invocations that are retried or that succeeded after a retry attempt
aws.events.triggered_rules
(count)
Measures the number of triggered rules that matched with any event.
aws.events.matched_events
(count)
Measures the number of events that matched with any rule.
aws.events.throttled_rules
(count)
Measures the number of triggered rules that are being throttled.
aws.natgateway.active_connection_count
(count)
The count of concurrent active TCP connections through the NAT gateway.
shown as connection
aws.natgateway.bytes_in_from_destination
(count)
The number of bytes received by the NAT Gateway from the destination.
shown as byte
aws.natgateway.bytes_in_from_source
(count)
The number of bytes received by the NAT Gateway from the VPC clients.
shown as byte
aws.natgateway.bytes_out_to_destination
(count)
The number of bytes sent through the NAT Gateway to the destination.
shown as byte
aws.natgateway.bytes_out_to_source
(count)
The number of bytes sent through the NAT Gateway to the VPC clients.
shown as byte
aws.natgateway.connection_attempt_count
(count)
The count of connections attempted through the NAT Gateway.
shown as attempt
aws.natgateway.connection_established_count
(count)
The count of connections established through the NAT Gateway.
shown as connection
aws.natgateway.error_port_allocation
(count)
The count of times a source port could not be allocated by the NAT Gateway.
shown as error
aws.natgateway.idle_timeout_count
(count)
The count of timeouts caused by connections going from active to idle state.
shown as timeout
aws.natgateway.packets_drop_count
(count)
The count of packets dropped by the NAT Gateway.
shown as packet
aws.natgateway.packets_in_from_destination
(count)
The number of packets received by the NAT Gateway from the destination.
shown as packet
aws.natgateway.packets_in_from_source
(count)
The number of packets received by the NAT Gateway from the VPC clients.
shown as packet
aws.natgateway.packets_out_to_destination
(count)
The number of packets sent through the NAT Gateway to the destination.
shown as packet
aws.natgateway.packets_out_to_source
(count)
The number of packets sent through the NAT Gateway to the VPC clients.
shown as packet
aws.states.execution_time
(gauge)
The average time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the execution started and the time it closed.
shown as millisecond
aws.states.execution_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the execution started and the time it closed.il
shown as millisecond
aws.states.executions_aborted
(count)
The number of executions that were aborted/terminated.
aws.states.execution_throttled
(count)
The number of StateEntered events in addition to retries
aws.states.executions_failed
(count)
The number of executions that failed.
aws.states.executions_started
(count)
The number of executions started.
aws.states.executions_succeeded
(count)
The number of executions that completed successfully.
aws.states.executions_timed_out
(count)
The number of executions that timed out for any reason.
aws.states.lambda_function_run_time
(gauge)
The average time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_run_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed.
shown as millisecond
aws.states.lambda_function_schedule_time
(gauge)
The avg time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.maximum
(gauge)
The maximum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.minimum
(gauge)
The minimum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_schedule_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.lambda_function_time
(gauge)
The average time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_function_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed.
shown as millisecond
aws.states.lambda_functions_failed
(count)
The number of lambda functions that failed.
aws.states.lambda_functions_heartbeat_timed_out
(count)
The number of lambda functions that were timed out due to a heartbeat timeout.
aws.states.lambda_functions_scheduled
(count)
The number of lambda functions that were scheduled.
aws.states.lambda_functions_started
(count)
The number of lambda functions that were started.
aws.states.lambda_functions_succeeded
(count)
The number of lambda functions that completed successfully.
aws.states.lambda_functions_timed_out
(count)
The number of lambda functions that were timed out on close.
aws.states.activity_run_time
(gauge)
The average time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_run_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the activity was started and when it was closed.
shown as millisecond
aws.states.activity_schedule_time
(gauge)
The avg time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.maximum
(gauge)
The maximum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.minimum
(gauge)
The minimum time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_schedule_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state.
shown as millisecond
aws.states.activity_time
(gauge)
The average time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.maximum
(gauge)
The maximum time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.minimum
(gauge)
The minimum time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.p95
(gauge)
The 95th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activity_time.p99
(gauge)
The 99th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed.
shown as millisecond
aws.states.activities_failed
(count)
The number of activities that failed.
aws.states.activities_heartbeat_timed_out
(count)
The number of activities that were timed out due to a heartbeat timeout.
aws.states.activities_scheduled
(count)
The number of activities that were scheduled.
aws.states.activities_started
(count)
The number of activities that were started.
aws.states.activities_succeeded
(count)
The number of activities that completed successfully.
aws.states.activities_timed_out
(count)
The number of activities that were timed out on close.
aws.trustedadvisor.green_checks
(gauge)
The number of Trusted Advisor checks in a green (OK) state.
shown as check
aws.trustedadvisor.yellow_checks
(gauge)
The number of Trusted Advisor checks in a yellow (WARN) state.
shown as check
aws.trustedadvisor.red_checks
(gauge)
the number of Trusted Advisor checks in a red (ERROR) state.
shown as check
aws.trustedadvisor.service_limit_usage
(gauge)
The percentage of resource utilization against a service limit.
shown as percent
aws.vpn.tunnel_data_in
(count)
The number of bytes that have come in through the VPN tunnel
shown as byte
aws.vpn.tunnel_data_out
(count)
The number of bytes that have gone out through the VPN tunnel
shown as byte
aws.vpn.tunnel_state
(gauge)
This metric is 1 when the VPN tunnel is up and 0 when it is down

Events

Events from AWS are collected on a per AWS-service basis. Please refer to the documentation of specific AWS services to learn more about the events collected.

Troubleshooting

Do you believe you’re seeing a discrepancy between your data in CloudWatch and Datadog?

There are two important distinctions to be aware of:

  1. In AWS for counters, a graph that is set to ‘sum’ ‘1minute’ shows the total number of occurrences in one minute leading up to that point, i.e. the rate per 1 minute. Datadog is displaying the raw data from AWS normalized to per second values, regardless of the timeframe selected in AWS, which is why you will probably see our value as lower.
  2. Overall, min/max/avg have a different meaning within AWS than in Datadog. In AWS, average latency, minimum latency, and maximum latency are three distinct metrics that AWS collects. When Datadog pulls metrics from AWS CloudWatch, we only get the average latency as a single time series per ELB. Within Datadog, when you are selecting ‘min’, ‘max’, or ‘avg’, you are controlling how multiple time series will be combined. For example, requesting system.cpu.idle without any filter would return one series for each host that reports that metric and those series need to be combined to be graphed. On the other hand, if you requested system.cpu.idle from a single host, no aggregation would be necessary and switching between average and max would yield the same result.

Metrics delayed?

When using the AWS integration, we’re pulling in metrics via the CloudWatch API. You may see a slight delay in metrics from AWS due to some constraints that exist for their API.

To begin, the CloudWatch API only offers a metric-by-metric crawl to pull data. The CloudWatch APIs have a rate limit that varies based on the combination of authentication credentials, region, and service. Metrics are made available by AWS dependent on the account level. For example, if you are paying for “detailed metrics” within AWS, they are available more quickly. This level of service for detailed metrics also applies to granularity, with some metrics being available per minute and others per five minutes.

On the Datadog side, we do have the ability to prioritize certain metrics within an account to pull them in faster, depending on the circumstances. Please contact support@datadoghq.com for more info on this.

To obtain metrics with virtually zero delay, we recommend installing the Datadog Agent on those hosts. We’ve written a bit about this here, especially in relation to CloudWatch.

Missing metrics?

CloudWatch’s API returns only metrics with datapoints, so if for instance an ELB has no attached instances, it is expected not to see metrics related to this ELB in Datadog.

Wrong count of aws.elb.healthy_host_count?

When the cross-zone load balancing option is enabled on an ELB, all the instances attached to this ELB are considered part of all availability zones (on CloudWatch’s side), so if you have 2 instances in 1a and 3 in ab, the metric will display 5 instances per availability zone. As this can be counter intuitive, we’ve added new metrics, aws.elb.healthy_host_count_deduped and aws.elb.un_healthy_host_count_deduped, that display the count of healthy and unhealthy instances per availability zone, regardless of if this cross-zone load balancing option is enabled or not.

Duplicated hosts when installing the Agent?

When installing the Agent on an AWS host, you might see duplicated hosts on the infra page for a few hours if you manually set the hostname in the Agent’s configuration. This second host will disapear a few hours later, and won’t affect your billing.