Cilium
Security Monitoring is now available Security Monitoring is now available

Cilium

Agent Check Agent Check

Supported OS: Linux Mac OS Windows

Overview

This check monitors Cilium through the Datadog Agent. The integration can either collect metrics from the cilium-agent or cilium-operator.

Setup

Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.

Installation

The Cilium check is included in the Datadog Agent package, but it requires additional setup steps to expose Prometheus metrics.

  1. In order to enable Prometheus metrics in both the cilium-agent and cilium-operator, deploy Cilium with the global.prometheus.enabled=true Helm value set, or:

  2. Separately enable Prometheus metrics:

    • In the cilium-agent add --prometheus-serve-addr=:9090 to the args section of the Cilium DaemonSet config:

      # [...]
      spec:
      containers:
       - args:
           - --prometheus-serve-addr=:9090
    • Or in the cilium-operator add --enable-metrics to the args section of the Cilium deployment config:

      # [...]
      spec:
      containers:
       - args:
           - --enable-metrics

Configuration

Host

  1. Edit the cilium.d/conf.yaml file, in the conf.d/ folder at the root of your Agent’s configuration directory to start collecting your Cilium performance data. See the sample cilium.d/conf.yaml for all available configuration options.

    • To collect cilium-agent metrics, enable the agent_endpoint option.
    • To collect cilium-operator metrics, enable the operator_endpoint option.
  2. Restart the Agent.

Log Collection

Cilium contains two types of logs: cilium-agent and cilium-operator.

  1. Collecting logs is disabled by default in the Datadog Agent. Enable it in your DaemonSet configuration:

     # (...)
       env:
       #  (...)
         - name: DD_LOGS_ENABLED
             value: "true"
         - name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
             value: "true"
     # (...)
  2. Mount the Docker socket to the Datadog Agent as done in this manifest or mount the /var/log/pods directory if you are not using Docker.

  3. Restart the Agent.

Containerized

For containerized environments, see the Autodiscovery Integration Templates for guidance on applying the parameters below.

Metric collection
ParameterValue
<INTEGRATION_NAME>cilium
<INIT_CONFIG>blank or {}
<INSTANCE_CONFIG>{"agent_endpoint": "http://%%host%%:9090/metrics"}
Log collection

Collecting logs is disabled by default in the Datadog Agent. To enable it, see Kubernetes log collection documentation.

ParameterValue
<LOG_CONFIG>{"source": "cilium-agent", "service": "cilium-agent"}

Validation

Run the Agent’s status subcommand and look for cilium under the Checks section.

Data Collected

Metrics

cilium.agent.api_process_time.seconds.count
(count)
Count of processing time for all API calls
Shown as request
cilium.agent.api_process_time.seconds.sum
(gauge)
Sum of processing time for all API calls
Shown as second
cilium.agent.bootstrap.seconds.count
(count)
Count of bootstrap durations
cilium.agent.bootstrap.seconds.sum
(gauge)
Sum of bootstrap durations
Shown as second
cilium.bpf.map_ops.total
(count)
Total BPF map operations performed
Shown as operation
cilium.controllers.failing.count
(count)
Number of failing controllers
Shown as error
cilium.controllers.runs_duration.seconds.count
(count)
Count of controller processes duration
Shown as operation
cilium.controllers.runs_duration.seconds.sum
(gauge)
Sum of controller processes duration
Shown as second
cilium.controllers.runs.total
(count)
Total number of controller runs
Shown as event
cilium.datapath.conntrack_gc.duration.seconds.count
(count)
Count of garbage collector process duration
Shown as operation
cilium.datapath.conntrack_gc.duration.seconds.sum
(gauge)
Sum of garbage collector process duration
Shown as second
cilium.datapath.conntrack_gc.entries
(gauge)
The number of alive and deleted conntrack entries
Shown as garbage collection
cilium.datapath.conntrack_gc.key_fallbacks.total
(count)
The total number of conntrack entries
Shown as garbage collection
cilium.datapath.conntrack_gc.runs.total
(count)
Total number of the conntrack garbage collector process runs
Shown as garbage collection
cilium.datapath.errors.total
(count)
Total number of errors in datapath management
Shown as error
cilium.drop_bytes.total
(count)
Total dropped bytes
Shown as byte
cilium.drop_count.total
(count)
Total dropped packets
Shown as packet
cilium.endpoint.count
(count)
Total ready endpoints managed by agent
Shown as unit
cilium.endpoint.regeneration_time_stats.seconds.count
(count)
Count of endpoint regeneration time stats
Shown as operation
cilium.endpoint.regeneration_time_stats.seconds.sum
(gauge)
Sum of endpoint regeneration time stats
Shown as second
cilium.endpoint.regenerations.count
(count)
Count of completed endpoint regenerations
Shown as unit
cilium.endpoint.state
(gauge)
Count of all endpoints
Shown as unit
cilium.errors_warning.total
(count)
Total error warnings
Shown as error
cilium.event_timestamp
(gauge)
Last timestamp of event received
Shown as time
cilium.forward_bytes.total
(count)
Total forwarded bytes
Shown as byte
cilium.forward_count.total
(count)
Total forwarded packets
Shown as packet
cilium.fqdn.gc_deletions.total
(count)
Total number of FQDNs cleaned in FQDN garbage collector job
Shown as event
cilium.identity.count
(gauge)
Number of identities allocated
Shown as unit
cilium.ip_addresses.count
(gauge)
Number of allocated ip_addresses
Shown as unit
cilium.ipam.events.total
(count)
Number of IPAM events received by action and datapath family type
Shown as event
cilium.k8s_client.api_calls.count
(count)
Number of API calls made to kube-apiserver
Shown as request
cilium.k8s_client.api_latency_time.seconds.count
(count)
Count of processed API call duration
Shown as request
cilium.k8s_client.api_latency_time.seconds.sum
(gauge)
Sum of processed API call duration
Shown as second
cilium.kubernetes.events_received.total
(count)
Number of Kubernetes received events processed
Shown as event
cilium.kubernetes.events.total
(count)
Number of Kubernetes events processed
Shown as event
cilium.nodes.all_datapath_validations.total
(count)
Number of validation calls to implement the datapath implemention of a node
Shown as unit
cilium.nodes.all_events_received.total
(count)
Number of node events received
Shown as event
cilium.nodes.managed.total
(gauge)
Number of nodes managed
Shown as node
cilium.policy.count
(gauge)
Number of policies currently loaded
Shown as unit
cilium.policy.endpoint_enforcement_status
(gauge)
Number of endpoints labeled by polict enforcement status
Shown as unit
cilium.policy.import_errors.count
(count)
Number of failed policy imports
Shown as error
cilium.policy.l7_denied.total
(count)
Number of total L7 denied requests/responses due to policy
Shown as unit
cilium.policy.l7_forwarded.total
(count)
Number of total L7 forwarded requests/responses
Shown as unit
cilium.policy.l7_parse_errors.total
(count)
Number of total L7 parse errors
Shown as error
cilium.policy.l7_received.total
(count)
Number of total L7 received requests/responses
Shown as unit
cilium.policy.max_revision
(gauge)
Highest policy revision number in the agent
Shown as unit
cilium.policy.regeneration_time_stats.seconds.count
(count)
Policy regeneration time stats count
Shown as operation
cilium.policy.regeneration_time_stats.seconds.sum
(gauge)
Policy regeneration time stats count
Shown as second
cilium.policy.regeneration.total
(count)
Total number of successful policy regenerations
Shown as unit
cilium.process.cpu.seconds.total
(gauge)
Process CPU time in seconds
Shown as second
cilium.process.max_fds
(gauge)
Process file descriptor maximum
Shown as file
cilium.process.open_fds
(gauge)
Number of open file descriptors
Shown as file
cilium.process.resident_memory.bytes
(gauge)
Total resident memory bytes
Shown as byte
cilium.process.start_time.seconds
(gauge)
Processes start time
Shown as second
cilium.process.virtual_memory.bytes
(gauge)
Virtual memory bytes
Shown as byte
cilium.process.virtual_memory.max.bytes
(gauge)
Maximum virtual memory bytes
Shown as byte
cilium.subprocess.start.total
(count)
Number of times that Cilium has started a subprocess
Shown as unit
cilium.triggers_policy.update_call_duration.seconds.count
(count)
Count of policy update trigger duration
Shown as operation
cilium.triggers_policy.update_call_duration.seconds.sum
(gauge)
Sum of policy update trigger duration
Shown as second
cilium.triggers_policy.update_folds
(gauge)
Number of folds
Shown as unit
cilium.triggers_policy.update.total
(count)
Total number of policy update trigger invocations
Shown as unit
cilium.unreachable.health_endpoints
(gauge)
Number of health endpoints that cannot be reached
Shown as unit
cilium.unreachable.nodes
(gauge)
Number of nodes that cannot be reached
Shown as node
cilium.operator.process.cpu.seconds
(count)
Total user and system CPU time spent in seconds
Shown as second
cilium.operator.process.max_fds
(gauge)
Maximum number of open file descriptors
Shown as file
cilium.operator.process.open_fds
(gauge)
Number of open file descriptors
Shown as file
cilium.operator.process.resident_memory.bytes
(gauge)
Resident memory size in bytes
Shown as byte
cilium.operator.process.start_time.second
(gauge)
Start time of the process since unix epoch in seconds
Shown as second
cilium.operator.process.virtual_memory.bytes
(gauge)
Virtual memory size in bytes
Shown as byte
cilium.operator.process.virtual_memory_max.bytes
(gauge)
Maximum amount of virtual memory available in bytes
Shown as byte
cilium.kvstore.operations_duration.seconds.count
(count)
Duration of kvstore operation count
Shown as operation
cilium.kvstore.operations_duration.seconds.sum
(gauge)
Duration of kvstore operation sum
Shown as second
cilium.kvstore.events_queue.seconds.count
(count)
Count of duration in seconds of received event was blocked before it could be queued
cilium.kvstore.events_queue.seconds.sum
(gauge)
Sum of duration in seconds received event was blocked before it could be queued
Shown as second
cilium.operator.eni.available
(gauge)
Number of ENI with addresses available
Shown as unit
cilium.operator.eni.available.ips_per_subnet
(gauge)
Number of available IPs per subnet ID
Shown as unit
cilium.operator.eni.aws_api_duration.seconds.count
(count)
Count of duration of interactions with AWS API
Shown as request
cilium.operator.eni.aws_api_duration.seconds.sum
(gauge)
Sum of duration of interactions with AWS API
Shown as second
cilium.operator.eni.deficit_resolver.duration.seconds.count
(count)
Count of duration of deficit resolver trigger runs
Shown as operation
cilium.operator.eni.deficit_resolver.duration.seconds.sum
(gauge)
Sum of duration of deficit resolver trigger runs
Shown as second
cilium.operator.eni.deficit_resolver.folds
(gauge)
Current level of deficit resolver folding
Shown as unit
cilium.operator.eni.deficit_resolver.latency.seconds.count
(count)
Count of latency between deficit resolver queue and trigger run
Shown as operation
cilium.operator.eni.deficit_resolver.latency.seconds.sum
(gauge)
Sum of latency between deficit resolver queue and trigger run
Shown as second
cilium.operator.eni.deficit_resolver.queued.total
(gauge)
Number of queued deficit resolver triggers
Shown as event
cilium.operator.eni.ec2_resync.duration.seconds.count
(count)
Count of duration of ec2 resync trigger runs
Shown as operation
cilium.operator.eni.ec2_resync.duration.seconds.sum
(gauge)
Sum of duration of ec2 resync trigger runs
Shown as second
cilium.operator.eni.ec2_resync.folds
(gauge)
Current level of ec2 resync folding
Shown as unit
cilium.operator.eni.ec2_resync.latency.seconds.count
(count)
Count of latency between ec2 resync queue and trigger run
Shown as operation
cilium.operator.eni.ec2_resync.latency.seconds.sum
(gauge)
Sum of latency between ec2 resync queue and trigger run
Shown as second
cilium.operator.eni.ec2_resync.queued.total
(gauge)
Number of queued ec2 resync triggers
Shown as unit
cilium.operator.eni.interface_creation_ops
(count)
Number of ENIs allocated
Shown as operation
cilium.operator.eni.ips.total
(gauge)
Number of IPs allocated
Shown as unit
cilium.operator.eni.k8s_sync.duration.seconds.count
(count)
Count of duration of k8s sync trigger run
Shown as operation
cilium.operator.eni.k8s_sync.duration.seconds.sum
(gauge)
Sum of duration of k8s sync trigger run
Shown as second
cilium.operator.eni.k8s_sync.folds
(gauge)
Current level of k8s sync folding
Shown as second
cilium.operator.eni.k8s_sync.latency.seconds.count
(count)
Count of duration of k8s sync latency between queue and trigger run
Shown as operation
cilium.operator.eni.k8s_sync.latency.seconds.sum
(gauge)
Sum of duration of k8s sync latency between queue and trigger run
Shown as second
cilium.operator.eni.k8s_sync.queued.total
(gauge)
Number of queued k8s sync triggers
Shown as unit
cilium.operator.eni.nodes.total
(gauge)
Number of nodes by category
Shown as node
cilium.operator.eni.resync.total
(count)
Number of resync operations to synchronize AWS EC2 metadata
Shown as unit

Service Checks

cilium.prometheus.health: Returns CRITICAL if the Agent cannot reach the metrics endpoints, OK otherwise.

Events

Cilium does not include any events.

Troubleshooting

Need help? Contact Datadog support.