Connect to Amazon Web Services (AWS) to:
Related integrations include:
API Gateway | create, publish, maintain, and secure APIs |
Autoscaling | scale EC2 capacity |
Billing | billing and budgets |
CloudFront | local content delivery network |
CloudTrail | access to log files and AWS API calls |
CloudSearch | access to log files and AWS API calls |
Direct Connect | dedicated network connection to AWS |
Dynamo DB | NoSQL Database |
EC2 Container Service (ECS) | container management service that supports Docker containers |
Elastic Beanstalk | easy-to-use service for deploying and scaling web applications and services |
Elastic Block Store (EBS) | persistent block level storage volumes |
ElastiCache | in-memory cache in the cloud |
Elastic Cloud Compute (EC2) | resizable compute capacity in the cloud |
Elastic File System (EFS) | shared file storage |
Elastic Load Balancing (ELB) | distributes incoming application traffic across multiple Amazon EC2 instances |
Elastic Map Reduce (EMR) | data processing using Hadoop |
Elasticsearch Service (ES) | deploy, operate, and scale Elasticsearch clusters |
Firehose | capture and load streaming data |
IOT | connect IOT devices with cloud services |
Kinesis | service for real-time processing of large, distributed data streams |
Key Management Service (KMS) | create and control encryption keys |
Lambda | serverless computing |
Machine Learning (ML) | create machine learning models |
OpsWorks | configuration management |
Polly | text-speech service |
Redshift | data warehouse solution |
Relational Database Service (RDS) | relational database in the cloud |
Route 53 | DNS and traffic management with availability monitoring |
Simple Email Service (SES) | cost-effective, outbound-only email-sending service |
Simple Notification System (SNS) | alerts and notifications |
Simple Queue Service (SQS) | messaging queue service |
Simple Storage Service (S3) | highly available and scalable cloud storage service |
Simple Workflow Service (SWF) | cloud workflow management |
Storage Gateway | hybrid cloud storage |
Web Application Firewall (WAF) | protect web applications from common web exploits |
Workspaces | secure desktop computing service |
X-Ray | tracing for distributed applications |
Setting up the Datadog integration with Amazon Web Services requires configuring role delegation using AWS IAM. To get a better understanding of role delegation, refer to the AWS IAM Best Practices guide.
Another AWS account
for the Role Type.464622532012
(Datadog’s account ID). This means that you are granting Datadog read only access to your AWS data.Require external ID
and enter the one generated in the Datadog app. Make sure you leave Require MFA disabled. For more information about the External ID, refer to this document in the IAM User Guide.Next: Permissions
.Create Policy
, which opens in a new window.JSON
tab. To take advantage of every AWS integration offered by Datadog, use policy snippet below in the textbox. As other components are added to an integration, these permissions may change.Review policy
.DatadogAWSIntegrationPolicy
or one of your own choosing, and provide an apt description.Create policy
. You can now close this window.Next: Review
.DatadogAWSIntegrationRole
, as well as an apt description. Click Create Role
.Bonus: If you use Terraform, set up your Datadog IAM policy using - The AWS Integration with Terraform .
The permissions listed below are included in the Policy Document using wild cards such as List*
and Get*
. If you require strict policies, please use the complete action names as listed and reference the Amazon API documentation for the services you require.
If you are not comfortable with granting all permissions, at the very least use the existing policies named AmazonEC2ReadOnlyAccess and CloudWatchReadOnlyAccess, for more detailed information regarding permissions see the Core Permissions tab.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"apigateway:GET",
"autoscaling:Describe*",
"budgets:ViewBudget",
"cloudfront:GetDistributionConfig",
"cloudfront:ListDistributions",
"cloudtrail:DescribeTrails",
"cloudtrail:GetTrailStatus",
"cloudwatch:Describe*",
"cloudwatch:Get*",
"cloudwatch:List*",
"codedeploy:List*",
"codedeploy:BatchGet*",
"directconnect:Describe*",
"dynamodb:List*",
"dynamodb:Describe*",
"ec2:Describe*",
"ecs:Describe*",
"ecs:List*",
"elasticache:Describe*",
"elasticache:List*",
"elasticfilesystem:DescribeFileSystems",
"elasticfilesystem:DescribeTags",
"elasticloadbalancing:Describe*",
"elasticmapreduce:List*",
"elasticmapreduce:Describe*",
"es:ListTags",
"es:ListDomainNames",
"es:DescribeElasticsearchDomains",
"health:DescribeEvents",
"health:DescribeEventDetails",
"health:DescribeAffectedEntities",
"kinesis:List*",
"kinesis:Describe*",
"lambda:AddPermission",
"lambda:GetPolicy",
"lambda:List*",
"lambda:RemovePermission",
"logs:Get*",
"logs:Describe*",
"logs:FilterLogEvents",
"logs:TestMetricFilter",
"logs:PutSubscriptionFilter",
"logs:DeleteSubscriptionFilter",
"logs:DescribeSubscriptionFilters",
"rds:Describe*",
"rds:List*",
"redshift:DescribeClusters",
"redshift:DescribeLoggingStatus",
"route53:List*",
"s3:GetBucketLogging",
"s3:GetBucketLocation",
"s3:GetBucketNotification",
"s3:GetBucketTagging",
"s3:ListAllMyBuckets",
"s3:PutBucketNotification",
"ses:Get*",
"sns:List*",
"sns:Publish",
"sqs:ListQueues",
"support:*",
"tag:GetResources",
"tag:GetTagKeys",
"tag:GetTagValues",
"xray:BatchGetTraces",
"xray:GetTraceSummaries"
],
"Effect": "Allow",
"Resource": "*"
}
]
}
The core Datadog AWS integration pulls data from AWS CloudWatch. At a minimum, your Policy Document needs to allow the following actions:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"cloudwatch:Get*",
"cloudwatch:List*",
"ec2:Describe*",
"support:*",
"tag:GetResources",
"tag:GetTagKeys",
"tag:GetTagValues"
],
"Effect": "Allow",
"Resource": "*"
}
]
}
AWS Permission | Description |
---|---|
cloudwatch:ListMetrics |
List the available CloudWatch metrics. |
cloudwatch:GetMetricData |
Fetch data points for a given metric. |
support:* : |
Add metrics about service limits. It requires full access because of AWS limitations |
tag:getResources |
Get custom tags by resource type. |
tag:getTagKeys |
Get tag keys by region within an AWS account. |
tag:getTagValues |
Get tag values by region within an AWS account. |
The main use of the Resource Group Tagging API is to reduce the number of API calls needed to collect custom tags. For more information, review the Tag policies documentation on the AWS website.
123456789012
, not 1234-5678-9012
. Your Account ID can be found in the ARN of the role created during the installation of the AWS integration. Then enter the name of the created role. Note: The role name you enter in the Integration Tile is case sensitive and must exactly match the role name created on the AWS side.AWS service logs are collected via the Datadog Lambda Function. This lambda—which triggers on S3 Buckets, Cloudwatch Log Groups, and Cloudwatch Events—forwards logs to Datadog.
To start collecting logs from your AWS services:
Configure the triggers that cause the lambda to execute. There are two ways to configure the triggers:
To add the Datadog log-forwarder Lambda to your AWS account, you can either use the AWS Serverless Repository or manually create a new Lambda.
Use the AWS Serverless Repository to deploy the Lambda in your AWS account.
Navigate to the Lambda console and create a new function:
Select Author from scratch and give the function a unique name.
Change the Runtime to Python 2.7
For Role
, select Create new role from template(s) and give the role a unique name.
If you are pulling logs from a S3 bucket, under Policy templates search for and select s3 object read-only permissions.
Select Create Function.
At the top of the script you’ll find a section called #Parameters
. You have two options for providing the API Key that the Lambda function requires:
Scroll down beyond the inline code area to Basic Settings.
Set the memory to around 1GB.
Set the timeout limit (120 seconds. recommended).
Scroll back to the top of the page and hit Save.
Any AWS service that generates logs into a S3 bucket or a CloudWatch Log Group is supported. Find specific setup instructions for the most used services in the table below:
AWS service | Activate AWS service logging | Send AWS logs to Datadog |
---|---|---|
API Gateway | Enable AWS API Gateway logs | Manual log collection |
Cloudfront | Enable AWS Cloudfront logs | Manual and automatic log collection |
Cloudtrail | Enable AWS Cloudtrail logs | Manual log collection |
DynamoDB | Enable AWS DynamoDB logs | Manual log collection |
EC2 | - |
Use the Datadog Agent to send your logs to Datadog |
ECS | - |
Use the docker agent to gather your logs |
Elastic Load Balancing (ELB) | Enable AWS ELB logs | Manual and automatic log collection |
Lambda | - |
Manual and automatic log collection |
RDS | Enable AWS RDS logs | Manual log collection |
Route 53 | Enable AWS Route 53 logs | Manual log collection |
S3 | Enable AWS S3 logs | Manual and automatic log collection |
SNS | There is no “SNS Logs”. Process logs and events that are transiting through to the SNS Service. | Manual log collection |
RedShift | Enable AWS Redshift logs | Manual and automatic log collection |
VPC | Enable AWS VPC logs | Manual log collection |
There are two options when configuring triggers on the Datadog Lambda function: * Manually set up triggers on S3 buckets, Cloudwatch Log Groups, or Cloudwatch Events. * Let Datadog automatically set and manage the list of triggers.
If you are storing logs in many S3 buckets or Cloudwatch Log groups, Datadog can automatically manage triggers for you.
Add the required permissions to your Datadog role in the IAM Console. You may already have some of these permissions from other Datadog-AWS integrations. Information on how these permissions are used can be found in the descriptions below:
"cloudfront:GetDistributionConfig",
"cloudfront:ListDistributions",
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeLoadBalancerAttributes",
"lambda:AddPermission",
"lambda:GetPolicy",
"lambda:RemovePermission",
"redshift:DescribeClusters",
"redshift:DescribeLoggingStatus",
"s3:GetBucketLogging",
"s3:GetBucketLocation",
"s3:GetBucketNotification",
"s3:ListAllMyBuckets",
"s3:PutBucketNotification",
"logs:PutSubscriptionFilter",
"logs:DeleteSubscriptionFilter",
"logs:DescribeSubscriptionFilters"
AWS Permission | Description |
---|---|
cloudfront:GetDistributionConfig |
Get the name of the S3 bucket containing CloudFront access logs. |
cloudfront:ListDistributions |
List all CloudFront distributions. |
elasticloadbalancing:DescribeLoadBalancers |
List all load balancers. |
elasticloadbalancing:DescribeLoadBalancerAttributes |
Get the name of the S3 bucket containing ELB access logs. |
lambda:AddPermission |
Add permission allowing a particular S3 bucket to trigger a Lambda function. |
lambda:GetPolicy |
Gets the Lambda policy when triggers are to be removed. |
lambda:RemovePermission |
Remove permissions from a Lambda policy. |
redshift:DescribeClusters |
List all Redshift clusters. |
redshift:DescribeLoggingStatus |
Get the name of the S3 bucket containing Redshift Logs. |
s3:GetBucketLogging |
Get the name of the S3 bucket containing S3 access logs. |
s3:GetBucketLocation |
Get the region of the S3 bucket containing S3 access logs. |
s3:GetBucketNotification |
Get existing Lambda trigger configurations. |
s3:ListAllMyBuckets |
List all S3 buckets. |
s3:PutBucketNotification |
Add or remove a Lambda trigger based on S3 bucket events. |
logs:PutSubscriptionFilter |
Add a Lambda trigger based on CloudWatch Log events |
logs:DeleteSubscriptionFilter |
Remove a Lambda trigger based on CloudWatch Log events |
logs:DescribeSubscriptionFilters |
Lists the subscription filters for the specified log group. |
If you are storing logs in a CloudWatch Log Group, send them to Datadog as follows:
Select the corresponding CloudWatch Log Group, add a filter name (but feel free to leave the filter empty) and add the trigger:
Once done, go into your Datadog Log section to start exploring your logs!
If you are storing logs in a S3 bucket, send them to Datadog as follows:
Once the lambda function is installed, manually add a trigger on the S3 bucket that contains your logs in the AWS console:
Select the bucket and then follow the AWS instructions:
Set the correct event type on S3 buckets:
Once done, go into your Datadog Log section to start exploring your logs!
aws.advisor.service_limit.max (gauge) |
Max usage of aws resources shown as service |
aws.advisor.service_limit.usage (gauge) |
Current usage of aws resources shown as service |
aws.advisor.service_limit.usage_ratio (gauge) |
The percentage of resource utilization against a service limit. shown as percent |
aws.appsync.latency (gauge) |
The average time between when AWS AppSync receives a request from a client and when it returns a response to the client. This doesn't include the network latency encountered for a response to reach the end devices. shown as millisecond |
aws.appsync.latency.p90 (gauge) |
The 90th percentile time between when AWS AppSync receives a request from a client and when it returns a response to the client. This doesn't include the network latency encountered for a response to reach the end devices. shown as millisecond |
aws.appsync.latency.maximum (gauge) |
The maximum time between when AWS AppSync receives a request from a client and when it returns a response to the client. This doesn't include the network latency encountered for a response to reach the end devices. shown as millisecond |
aws.appsync.4xxerror (count) |
The number of errors captured as a result of invalid requests due to incorrect client configuration. shown as error |
aws.appsync.5xxerror (count) |
Errors encountered during the execution of a GraphQL query. shown as error |
aws.logs.incoming_bytes (gauge) |
The volume of log events in uncompressed bytes uploaded to Cloudwatch Logs. shown as byte |
aws.logs.incoming_log_events (count) |
The number of log events uploaded to Cloudwatch Logs. shown as event |
aws.logs.forwarded_bytes (gauge) |
The volume of log events in compressed bytes forwarded to the subscription destination. shown as byte |
aws.logs.forwarded_log_events (count) |
The number of log events forwarded to the subscription destination. shown as event |
aws.logs.delivery_errors (count) |
The number of log events for which CloudWatch Logs received an error when forwarding data to the subscription destination. shown as event |
aws.logs.delivery_throttling (count) |
The number of log events for which CloudWatch Logs was throttled when forwarding data to the subscription destination. shown as event |
aws.ec2spot.available_instance_pools_count (count) |
The Spot Instance pools specified in the Spot Fleet request. shown as instance |
aws.ec2spot.bids_submitted_for_capacity (count) |
The capacity for which Amazon EC2 has submitted bids. shown as instance |
aws.ec2spot.eligible_instance_pool_count (count) |
The Spot Instance pools specified in the Spot Fleet request where Amazon EC2 can fulfill bids. shown as instance |
aws.ec2spot.fulfilled_capacity (count) |
The capacity that Amazon EC2 has fulfilled. shown as instance |
aws.ec2spot.max_percent_capacity_allocation (gauge) |
The maximum value of PercentCapacityAllocation across all Spot Instance pools specified in the Spot Fleet request. shown as percent |
aws.ec2spot.pending_capacity (count) |
The difference between TargetCapacity and FulfilledCapacity. shown as instance |
aws.ec2spot.percent_capacity_allocation (gauge) |
The capacity allocated for the Spot Instance pool for the specified dimensions. shown as percent |
aws.ec2spot.target_capacity (count) |
The target capacity of the Spot Fleet request. shown as instance |
aws.ec2spot.terminating_capacity (count) |
The capacity that is being terminated due to Spot Instance interruptions. shown as instance |
aws.ddosprotection.ddo_sattack_bits_per_second (gauge) |
The number of bytes observed during a DDoS event for a particular Amazon Resource Name (ARN). shown as byte |
aws.ddosprotection.ddo_sattack_requests_per_second (gauge) |
The number of requests observed during a DDoS event for a particular Amazon Resource Name (ARN). shown as request |
aws.ddosprotection.ddo_sdetected (gauge) |
Indicates a DDoS event for a particular Amazon Resource Name (ARN). |
aws.dms.cpuutilization (gauge) |
Average percentage of allocated EC2 compute units that are currently in use on the instance. |
aws.dms.free_storage_space (gauge) |
The amount of available storage space shown as byte |
aws.dms.freeable_memory (gauge) |
The amount of available random access memory. shown as byte |
aws.dms.write_iops (gauge) |
The average number of disk I/O operations per second shown as operation |
aws.dms.read_iops (gauge) |
The average number of disk I/O operations per second. shown as operation |
aws.dms.write_throughput (gauge) |
The average number of bytes written to disk per second. shown as byte |
aws.dms.read_throughput (gauge) |
The average number of bytes read from disk per second. shown as byte |
aws.dms.write_latency (gauge) |
The average amount of time taken per write disk I/O operation shown as second |
aws.dms.read_latency (gauge) |
The average amount of time taken per read disk I/O operation shown as second |
aws.dms.swap_usage (gauge) |
The amount of swap space used on the DB Instance shown as byte |
aws.dms.network_transmit_throughput (gauge) |
The outgoing (Transmit) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication shown as byte |
aws.dms.network_receive_throughput (gauge) |
The incoming (Receive) network traffic on the DB instance including both customer database traffic and Amazon RDS traffic used for monitoring and replication. shown as byte |
aws.dms.full_load_throughput_bandwidth_source (gauge) |
Incoming network bandwidth from a full load from the source shown as kibibyte |
aws.dms.full_load_throughput_bandwidth_target (gauge) |
Outgoing network bandwidth from a full load for the target shown as kibibyte |
aws.dms.full_load_throughput_rows_source (gauge) |
Incoming changes from a full load from the source in rows per second shown as row |
aws.dms.full_load_throughput_rows_target (gauge) |
Outgoing changes from a full load for the target shown as row |
aws.dms.cdcincoming_changes (gauge) |
Total row count of changes for the task shown as row |
aws.dms.cdcchanges_memory_source (gauge) |
Amount of rows accumulating in a memory and waiting to be committed from the source shown as row |
aws.dms.cdcchanges_memory_target (gauge) |
Amount of rows accumulating in a memory and waiting to be committed to the target shown as row |
aws.dms.cdcchanges_disk_source (gauge) |
Amount of rows accumulating on disk and waiting to be committed from the source shown as row |
aws.dms.cdcchanges_disk_target (gauge) |
Amount of rows accumulating on disk and waiting to be committed to the target shown as row |
aws.dms.cdcthroughput_bandwidth_source (gauge) |
Incoming task network bandwidth from the source shown as kibibyte |
aws.dms.cdcthroughput_bandwidth_target (gauge) |
Outgoing task network bandwidth for the target shown as kibibyte |
aws.dms.cdcthroughput_rows_source (gauge) |
Incoming task changes from the source shown as row |
aws.dms.cdcthroughput_rows_target (gauge) |
Outgoing task changes for the target shown as row |
aws.dms.cdclatency_source (gauge) |
Latency reading from source shown as second |
aws.dms.cdclatency_target (gauge) |
Latency writing to the target shown as second |
aws.events.invocations (count) |
Measures the number of times a target is invoked for a rule in response to an event. This includes successful and failed invocations but does not include throttled or retried attempts until they fail permanently. |
aws.events.failed_invocations (count) |
Measures the number of invocations that failed permanently. This does not include invocations that are retried or that succeeded after a retry attempt |
aws.events.triggered_rules (count) |
Measures the number of triggered rules that matched with any event. |
aws.events.matched_events (count) |
Measures the number of events that matched with any rule. |
aws.events.throttled_rules (count) |
Measures the number of triggered rules that are being throttled. |
aws.natgateway.active_connection_count (count) |
The count of concurrent active TCP connections through the NAT gateway. shown as connection |
aws.natgateway.bytes_in_from_destination (count) |
The number of bytes received by the NAT Gateway from the destination. shown as byte |
aws.natgateway.bytes_in_from_source (count) |
The number of bytes received by the NAT Gateway from the VPC clients. shown as byte |
aws.natgateway.bytes_out_to_destination (count) |
The number of bytes sent through the NAT Gateway to the destination. shown as byte |
aws.natgateway.bytes_out_to_source (count) |
The number of bytes sent through the NAT Gateway to the VPC clients. shown as byte |
aws.natgateway.connection_attempt_count (count) |
The count of connections attempted through the NAT Gateway. shown as attempt |
aws.natgateway.connection_established_count (count) |
The count of connections established through the NAT Gateway. shown as connection |
aws.natgateway.error_port_allocation (count) |
The count of times a source port could not be allocated by the NAT Gateway. shown as error |
aws.natgateway.idle_timeout_count (count) |
The count of timeouts caused by connections going from active to idle state. shown as timeout |
aws.natgateway.packets_drop_count (count) |
The count of packets dropped by the NAT Gateway. shown as packet |
aws.natgateway.packets_in_from_destination (count) |
The number of packets received by the NAT Gateway from the destination. shown as packet |
aws.natgateway.packets_in_from_source (count) |
The number of packets received by the NAT Gateway from the VPC clients. shown as packet |
aws.natgateway.packets_out_to_destination (count) |
The number of packets sent through the NAT Gateway to the destination. shown as packet |
aws.natgateway.packets_out_to_source (count) |
The number of packets sent through the NAT Gateway to the VPC clients. shown as packet |
aws.states.execution_time (gauge) |
The average time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond |
aws.states.execution_time.maximum (gauge) |
The maximum time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond |
aws.states.execution_time.minimum (gauge) |
The minimum time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond |
aws.states.execution_time.p95 (gauge) |
The 95th percentile time interval, in milliseconds, between the time the execution started and the time it closed. shown as millisecond |
aws.states.execution_time.p99 (gauge) |
The 99th percentile time interval, in milliseconds, between the time the execution started and the time it closed.il shown as millisecond |
aws.states.executions_aborted (count) |
The number of executions that were aborted/terminated. |
aws.states.execution_throttled (count) |
The number of StateEntered events in addition to retries |
aws.states.executions_failed (count) |
The number of executions that failed. |
aws.states.executions_started (count) |
The number of executions started. |
aws.states.executions_succeeded (count) |
The number of executions that completed successfully. |
aws.states.executions_timed_out (count) |
The number of executions that timed out for any reason. |
aws.states.lambda_function_run_time (gauge) |
The average time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond |
aws.states.lambda_function_run_time.maximum (gauge) |
The maximum time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond |
aws.states.lambda_function_run_time.minimum (gauge) |
The minimum time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond |
aws.states.lambda_function_run_time.p95 (gauge) |
The 95th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond |
aws.states.lambda_function_run_time.p99 (gauge) |
The 99th percentile time interval, in milliseconds, between the time the lambda function was started and when it was closed. shown as millisecond |
aws.states.lambda_function_schedule_time (gauge) |
The avg time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
aws.states.lambda_function_schedule_time.maximum (gauge) |
The maximum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
aws.states.lambda_function_schedule_time.minimum (gauge) |
The minimum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
aws.states.lambda_function_schedule_time.p95 (gauge) |
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
aws.states.lambda_function_schedule_time.p99 (gauge) |
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
aws.states.lambda_function_time (gauge) |
The average time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond |
aws.states.lambda_function_time.maximum (gauge) |
The maximum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond |
aws.states.lambda_function_time.minimum (gauge) |
The minimum time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond |
aws.states.lambda_function_time.p95 (gauge) |
The 95th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond |
aws.states.lambda_function_time.p99 (gauge) |
The 99th percentile time interval, in milliseconds, between the time the lambda function was scheduled and when it was closed. shown as millisecond |
aws.states.lambda_functions_failed (count) |
The number of lambda functions that failed. |
aws.states.lambda_functions_heartbeat_timed_out (count) |
The number of lambda functions that were timed out due to a heartbeat timeout. |
aws.states.lambda_functions_scheduled (count) |
The number of lambda functions that were scheduled. |
aws.states.lambda_functions_started (count) |
The number of lambda functions that were started. |
aws.states.lambda_functions_succeeded (count) |
The number of lambda functions that completed successfully. |
aws.states.lambda_functions_timed_out (count) |
The number of lambda functions that were timed out on close. |
aws.states.activity_run_time (gauge) |
The average time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond |
aws.states.activity_run_time.maximum (gauge) |
The maximum time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond |
aws.states.activity_run_time.minimum (gauge) |
The minimum time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond |
aws.states.activity_run_time.p95 (gauge) |
The 95th percentile time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond |
aws.states.activity_run_time.p99 (gauge) |
The 99th percentile time interval, in milliseconds, between the time the activity was started and when it was closed. shown as millisecond |
aws.states.activity_schedule_time (gauge) |
The avg time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
aws.states.activity_schedule_time.maximum (gauge) |
The maximum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
aws.states.activity_schedule_time.minimum (gauge) |
The minimum time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
aws.states.activity_schedule_time.p95 (gauge) |
The 95th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
aws.states.activity_schedule_time.p99 (gauge) |
The 99th percentile time interval, in milliseconds, that the activity stayed in the schedule state. shown as millisecond |
aws.states.activity_time (gauge) |
The average time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond |
aws.states.activity_time.maximum (gauge) |
The maximum time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond |
aws.states.activity_time.minimum (gauge) |
The minimum time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond |
aws.states.activity_time.p95 (gauge) |
The 95th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond |
aws.states.activity_time.p99 (gauge) |
The 99th percentile time interval, in milliseconds, between the time the activity was scheduled and when it was closed. shown as millisecond |
aws.states.activities_failed (count) |
The number of activities that failed. |
aws.states.activities_heartbeat_timed_out (count) |
The number of activities that were timed out due to a heartbeat timeout. |
aws.states.activities_scheduled (count) |
The number of activities that were scheduled. |
aws.states.activities_started (count) |
The number of activities that were started. |
aws.states.activities_succeeded (count) |
The number of activities that completed successfully. |
aws.states.activities_timed_out (count) |
The number of activities that were timed out on close. |
aws.trustedadvisor.green_checks (gauge) |
The number of Trusted Advisor checks in a green (OK) state. shown as check |
aws.trustedadvisor.yellow_checks (gauge) |
The number of Trusted Advisor checks in a yellow (WARN) state. shown as check |
aws.trustedadvisor.red_checks (gauge) |
the number of Trusted Advisor checks in a red (ERROR) state. shown as check |
aws.trustedadvisor.service_limit_usage (gauge) |
The percentage of resource utilization against a service limit. shown as percent |
aws.vpn.tunnel_data_in (count) |
The number of bytes that have come in through the VPN tunnel shown as byte |
aws.vpn.tunnel_data_out (count) |
The number of bytes that have gone out through the VPN tunnel shown as byte |
aws.vpn.tunnel_state (gauge) |
This metric is 1 when the VPN tunnel is up and 0 when it is down |
Events from AWS are collected on a per AWS-service basis. Please refer to the documentation of specific AWS services to learn more about the events collected.
There are two important distinctions to be aware of:
system.cpu.idle
without any filter would return one series for each host that reports that metric and those series need to be combined to be graphed. On the other hand, if you requested system.cpu.idle
from a single host, no aggregation would be necessary and switching between average and max would yield the same result.When using the AWS integration, Datadog pulls in your metrics via the CloudWatch API. You may see a slight delay in metrics from AWS due to some constraints that exist for their API.
To begin, the CloudWatch API only offers a metric-by-metric crawl to pull data. The CloudWatch APIs have a rate limit that varies based on the combination of authentication credentials, region, and service. Metrics are made available by AWS dependent on the account level. For example, if you are paying for “detailed metrics” within AWS, they are available more quickly. This level of service for detailed metrics also applies to granularity, with some metrics being available per minute and others per five minutes.
Datadog has the ability to prioritize certain metrics within an account to pull them in faster, depending on the circumstances. Please contact Datadog support for more info.
To obtain metrics with virtually zero delay, install the Datadog Agent on the host. For more information, see Datadog’s blog post Don’t fear the Agent: Agent-based monitoring.
CloudWatch’s API returns only metrics with data points, so if for instance an ELB has no attached instances, it is expected not to see metrics related to this ELB in Datadog.
When the cross-zone load balancing option is enabled on an ELB, all the instances attached to this ELB are considered part of all availability zones (on CloudWatch’s side), so if you have 2 instances in 1a and 3 in ab, the metric displays 5 instances per availability zone. As this can be counter intuitive, we’ve added new metrics, aws.elb.healthy_host_count_deduped and aws.elb.un_healthy_host_count_deduped, that display the count of healthy and unhealthy instances per availability zone, regardless of if this cross-zone load balancing option is enabled or not.
When installing the Agent on an AWS host, you might see duplicated hosts on the infra page for a few hours if you manually set the hostname in the Agent’s configuration. This second host disappears a few hours later, and won’t affect your billing.