Datadog automatically generates suggestions to resolve errors and performance problems and optimizes cost for your serverless applications.

In addition to the insights provided by Watchdog for Serverless, Datadog Serverless Monitoring detects a number of issues regarding your functions and creates warnings.

Setup

Datadog uses AWS CloudWatch metrics, Datadog enhanced AWS Lambda metrics, and Lambda REPORT log lines to suggest warnings to you. To set these up,

  1. Set up the Amazon Web Services integration.
  2. Set up the Datadog Forwarder and ensure your Lambda REPORT logs are indexed in Datadog.
  3. Enable Enhanced Lambda Metrics for your functions.

Note: Datadog generates High Errors, High Duration, Throttled, and High Iterator Age warnings out of the box after setting up the AWS integration. All other warnings, including those generated on individual invocations, require the Datadog Forwarder and Enhanced Lambda Metrics.

Generated warnings

Errors

More than 1% of the function’s invocations were errors in the selected time range.

Resolution: Examine the function’s logs, check for recent code or configuration changes with Deployment Tracking, or look for failures across microservices with distributed tracing.

High errors

More than 10% of the function’s invocations were errors in the selected time range.

Resolution: Examine the function’s logs, check for recent code or configuration changes with Deployment Tracking, or look for failures across microservices with distributed tracing.

High memory usage

At least one invocation in the selected time range used over 95% of the allocated memory.

Distributed tracing can help you pinpoint Lambda functions with low memory limits and parts of your application using excessive amounts of memory.

Resolution: Lambda functions using close to their maximum configured memory are at risk of being killed by the Lambda runtime, resulting in user-facing errors. Consider increasing the amount of configured memory on your function. Note that this could affect your AWS bill.

High duration

At least one invocation in the selected time range exceeded 95% of the configured timeout.

Distributed tracing can help you pinpoint slow API calls in your application.

Resolution: Lambda functions running for close to their configured timeout are at risk of being killed by the Lambda runtime. This could lead to slow or failed responses to incoming requests. Consider increasing the configured timeout if you expect your function to need more execution time. Note that this could affect your AWS bill.

Cold starts

More than 1% of the function’s invocations were cold starts in the selected time range.

Datadog’s enhanced metrics and distributed tracing can help you understand the impact of cold starts on your applications today.

Resolution: Cold starts occur when your serverless applications receive sudden increases in traffic, and can occur when the function was previously inactive or when it was receiving a relatively constant number of requests. Users may perceive cold starts as slow response times or lag. To get ahead of cold starts, consider enabling provisioned concurrency on your impacted Lambda functions. Note that this could affect your AWS bill.

Out of memory

At least one invocation in the selected time range ran out of memory.

Resolution: Lambda functions that use more than their allotted amount of memory can be killed by the Lambda runtime. To users, this may look like failed requests to your application. Distributed tracing can help you pinpoint parts of your application using excessive amounts of memory. Consider increasing the amount of memory your Lambda function is allowed to use.

Timeouts

At least one invocation in the selected time range timed out. This occurs when your function runs for longer than the configured timeout or the global Lambda timeout.

Resolution: Distributed tracing can help you pinpoint slow requests to APIs and other microservices. You can also consider increasing the timeout of your function. Note that this could affect your AWS bill.

Throttles

More than 10% of invocations in the selected time range were throttled. Throttling occurs when your serverless Lambda applications receive high levels of traffic without adequate concurrency.

Resolution: Check your Lambda concurrency metrics and confirm if aws.lambda.concurrent_executions.maximum is approaching your AWS account concurrency level. If so, consider configuring reserved concurrency, or request a service quota increase from AWS. Note that this may affect your AWS bill.

High iterator age

The function’s iterator was older than two hours. Iterator age measures the age of the last record for each batch of records processed from a stream. When this value increases, it means your function cannot process data fast enough.

Resolution: Enable distributed tracing to isolate why your function has so much data being streamed to it. You can also consider increasing the shard count and batch size of the stream your function reads from.

Over provisioned

No invocation in the selected time range used more than 10% of the allocated memory. This means your function has more billable resources allocated to it than it may need.

Resolution: Consider decreasing the amount of allocated memory on your Lambda function. Note that this may affect your AWS bill.

Threats detected

Attack attempts were detected targeting the serverless application.

Resolution: Investigate the attack attempts in ASM by clicking the Security Signals button to determine how to respond. If immediate action is needed, you can block the attacking IP in your WAF through the Workflows integration.

Under provisioned

CPU utilization for this function averaged more than 80%. This means your function may see increased performance from additional CPU resources.

Resolution: Consider increasing the amount of allocated memory on your Lambda function. Increasing the amount of memory scales available CPU resources. Note this may affect your AWS bill.

Overallocated provisioned concurrency

The function’s provisioned concurrency utilization was below 60%. According to AWS, provisioned concurrency is best optimized for cost when utilization is consistently greater than 60%.

Resolution: Consider decreasing the amount of configured provisioned concurrency for your function.

Deprecated runtime

The function’s runtime is no longer supported.

Resolution: Upgrade to the latest runtime to ensure you are up to date on the latest security, performance, and reliability standards.

Reaching maximum duration

At least one invocation in the selected time range approached the maximum duration limit of 15 minutes.

Distributed tracing can help you pinpoint slow API calls in your application.

Resolution: Lambda functions approaching the maximum timeout limit of 15 minutes risk termination by the Lambda runtime. This could lead to slow or failed responses to incoming requests. Consider improving the performance of your Lambda function, using smaller lambdas in a Step Function, or moving your workload to a longer running environment like ECS Fargate.

Recursive invocations dropped

Invocations in this function have a recursive loop, generally caused by recursive triggering between AWS entities (for example, Lambda -> SQS -> Lambda). When this exceeds your maxReceiveCount (default 16), then it adds to this metric. For more information, see Use Lambda recursive loop detection to prevent infinite loops.

Resolution: Find recursive calls in your AWS entities related to this function. Look for related entities such as SQS, SNS, and S3.

Further Reading

Additional helpful documentation, links, and articles: