For AI agents: A markdown version of this page is available at https://docs.datadoghq.com/integrations/prefect.md. A documentation index is available at /llms.txt.

Prefect

Supported OS Linux Windows Mac OS

Integration version1.0.0

To find out if this integration is available in your organization, see your Datadog Integrations page or ask your organization administrator.

To initiate an exception request to enable this integration for your organization, email support@ddog-gov.com.

Overview

This check monitors Prefect Server through the Datadog Agent.

Prefect is a Python-first workflow orchestration platform used to schedule and execute flows and tasks across work pools, work queues, and workers. This integration collects orchestration health and performance metrics and events directly from the Prefect Server API and supports log collection for comprehensive monitoring.

What this integration monitors

The integration collects metrics across multiple layers of the Prefect orchestration hierarchy:

  • Server health: API readiness and health status to confirm the control plane is operational.
  • Work pool layer: Pool readiness, paused or not-ready state, and aggregated worker availability to detect capacity or configuration issues.
  • Worker layer: Online or offline status and heartbeat age to identify lost or unhealthy workers.
  • Work queue layer: Backlog size, backlog age, last polled age, concurrency utilization, and queue state (ready, paused or not-ready) to detect congestion, starvation, and stalled consumers.
  • Deployment and flow layer: Flow run counts by state (running, completed, failed, crashed, etc.), throughput, late starts, execution duration, queue wait time, and retry gaps to track reliability and latency percentiles.
  • Task layer: Task run counts by state, throughput, execution duration, and dependency wait time to enable drilldowns from slow flows to individual task bottlenecks.
  • Events: Prefect events for state transitions and lifecycle changes.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The Prefect check is included in the Datadog Agent package. No additional installation is needed on your server.

Configuration

  1. Edit the prefect.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your Prefect performance data. See the sample prefect.d/conf.yaml for all available configuration options.

  2. Restart the Agent.

Validation

Run the Agent’s status subcommand and look for prefect under the Checks section.

Data Collected

Metrics

prefect.server.deployment.is_ready
(gauge)
Indicates whether the deployment is ready. Submits 1 when ready and 0 when not.
prefect.server.flow_runs.cancelled.count
(count)
Number of flow runs that were cancelled in the collection interval.
Shown as run
prefect.server.flow_runs.completed.count
(count)
Number of flow runs that completed successfully in the collection interval.
Shown as run
prefect.server.flow_runs.crashed.count
(count)
Number of flow runs that crashed unexpectedly in the collection interval.
Shown as run
prefect.server.flow_runs.execution_duration.95percentile
(gauge)
95th percentile execution time of flow runs.
Shown as second
prefect.server.flow_runs.execution_duration.avg
(gauge)
Average execution time of flow runs.
Shown as second
prefect.server.flow_runs.execution_duration.count
(gauge)
Count of flow run execution duration samples.
Shown as run
prefect.server.flow_runs.execution_duration.max
(gauge)
Maximum execution time of flow runs.
Shown as second
prefect.server.flow_runs.execution_duration.median
(gauge)
Median execution time of flow runs.
Shown as second
prefect.server.flow_runs.failed.count
(count)
Number of flow runs that failed in the collection interval.
Shown as run
prefect.server.flow_runs.late_start.count
(count)
Number of flow runs that started later than their scheduled time.
Shown as run
prefect.server.flow_runs.paused
(gauge)
Number of flow runs currently paused.
Shown as run
prefect.server.flow_runs.pending
(gauge)
Number of flow runs currently in the pending state.
Shown as run
prefect.server.flow_runs.queue_wait_duration.95percentile
(gauge)
95th percentile time a flow run spent waiting in the queue before starting.
Shown as second
prefect.server.flow_runs.queue_wait_duration.avg
(gauge)
Average time a flow run spent waiting in the queue before starting.
Shown as second
prefect.server.flow_runs.queue_wait_duration.count
(gauge)
Count of flow run queue wait duration samples.
Shown as run
prefect.server.flow_runs.queue_wait_duration.max
(gauge)
Maximum time a flow run spent waiting in the queue before starting.
Shown as second
prefect.server.flow_runs.queue_wait_duration.median
(gauge)
Median time a flow run spent waiting in the queue before starting.
Shown as second
prefect.server.flow_runs.retry_gaps_duration.95percentile
(gauge)
95th percentile time gap between consecutive retries of the same flow run.
Shown as second
prefect.server.flow_runs.retry_gaps_duration.avg
(gauge)
Average time gap between consecutive retries of the same flow run.
Shown as second
prefect.server.flow_runs.retry_gaps_duration.count
(gauge)
Count of flow run retry gap samples.
Shown as occurrence
prefect.server.flow_runs.retry_gaps_duration.max
(gauge)
Maximum time gap between consecutive retries of the same flow run.
Shown as second
prefect.server.flow_runs.retry_gaps_duration.median
(gauge)
Median time gap between consecutive retries of the same flow run.
Shown as second
prefect.server.flow_runs.running
(gauge)
Number of flow runs currently in the running state.
Shown as run
prefect.server.flow_runs.scheduled
(gauge)
Number of flow runs currently in the scheduled state.
Shown as run
prefect.server.flow_runs.throughput
(count)
Count of flow runs started per second.
Shown as run
prefect.server.health
(gauge)
Indicates that the Prefect API is responding to requests. Submits 1 when healthy and 0 when not.
prefect.server.ready
(gauge)
Indicates that the Prefect API is able to accept and process work. Submits 1 when ready and 0 when not.
prefect.server.task_runs.cancelled.count
(count)
Number of task runs that were cancelled in the collection interval.
Shown as run
prefect.server.task_runs.completed.count
(count)
Number of task runs that completed successfully in the collection interval.
Shown as run
prefect.server.task_runs.crashed.count
(count)
Number of task runs that crashed unexpectedly in the collection interval.
Shown as run
prefect.server.task_runs.dependency_wait_duration.95percentile
(gauge)
95th percentile time a task run waited after its latest upstream dependency completed.
Shown as second
prefect.server.task_runs.dependency_wait_duration.avg
(gauge)
Average time a task run waited after its latest upstream dependency completed.
Shown as second
prefect.server.task_runs.dependency_wait_duration.count
(gauge)
Count of task run dependency wait duration samples.
Shown as occurrence
prefect.server.task_runs.dependency_wait_duration.max
(gauge)
Maximum time a task run waited after its latest upstream dependency completed.
Shown as second
prefect.server.task_runs.dependency_wait_duration.median
(gauge)
Median time a task run waited after its latest upstream dependency completed.
Shown as second
prefect.server.task_runs.execution_duration.95percentile
(gauge)
95th percentile execution time of individual task runs.
Shown as second
prefect.server.task_runs.execution_duration.avg
(gauge)
Average execution time of individual task runs.
Shown as second
prefect.server.task_runs.execution_duration.count
(gauge)
Count of task run execution duration samples.
Shown as run
prefect.server.task_runs.execution_duration.max
(gauge)
Maximum execution time of individual task runs.
Shown as second
prefect.server.task_runs.execution_duration.median
(gauge)
Median execution time of individual task runs.
Shown as second
prefect.server.task_runs.failed.count
(count)
Number of task runs that failed in the collection interval.
Shown as run
prefect.server.task_runs.late_start.count
(count)
Number of task runs that started later than their scheduled time.
Shown as run
prefect.server.task_runs.paused
(gauge)
Number of task runs currently paused.
Shown as run
prefect.server.task_runs.pending
(gauge)
Number of task runs currently in the pending state.
Shown as run
prefect.server.task_runs.running
(gauge)
Number of task runs currently in the running state.
Shown as run
prefect.server.task_runs.throughput
(count)
Count of task runs started per second.
Shown as run
prefect.server.work_pool.is_not_ready
(gauge)
Whether the work pool is not ready to accept and dispatch flow runs. Submits 1 when true and 0 when false.
prefect.server.work_pool.is_paused
(gauge)
Whether the work pool is paused. Submits 1 when true and 0 when false.
prefect.server.work_pool.is_ready
(gauge)
Whether the work pool is ready to accept and dispatch flow runs. Submits 1 when true and 0 when false.
prefect.server.work_pool.worker.heartbeat_age_seconds
(gauge)
Time since the worker last sent a heartbeat.
Shown as second
prefect.server.work_pool.worker.is_online
(gauge)
Whether the worker is online. Submits 1 when true and 0 when false.
prefect.server.work_queue.backlog.age
(gauge)
Age of the oldest item in the queue backlog.
Shown as second
prefect.server.work_queue.backlog.size
(gauge)
Number of flow runs waiting in the queue backlog.
Shown as run
prefect.server.work_queue.concurrency.in_use
(gauge)
Percentage of concurrency in use by the queue.
Shown as percent
prefect.server.work_queue.is_not_ready
(gauge)
Whether the work queue is not ready to accept and dispatch flow runs. Submits 1 when true and 0 when false.
prefect.server.work_queue.is_paused
(gauge)
Whether the work queue is paused. Submits 1 when true and 0 when false.
prefect.server.work_queue.is_ready
(gauge)
Whether the work queue is ready to accept and dispatch flow runs. Submits 1 when true and 0 when false.
prefect.server.work_queue.last_polled_age_seconds
(gauge)
Time elapsed since any worker last polled the queue.
Shown as second

Logs

  1. Enable log collection in your datadog.yaml file:

    logs_enabled: true
    
  2. Uncomment and edit the logs configuration block in your prefect.d/conf.yaml file. For example:

    logs:
      - type: docker
        source: prefect
        service: <SERVICE>
    

Events

The Prefect integration includes event support. Events are disabled by default; to enable them, set collect_events to true in the configuration.

After you enable it, the integration submits flow-run, task-run, and ready or not-ready events. The set of submitted events can be customized by adding or removing entries in the configuration.

Service Checks

The Prefect integration does not include any service checks.

Uninstallation

To disable the integration, rename the configuration file from prefect.yaml to prefect.yaml.example. Alternatively, if you are running a containerized environment, you can remove the annotation used to enable the integration.

Support

Need help? Contact Datadog Support.