Cilium
Rapport de recherche Datadog : Bilan sur l'adoption de l'informatique sans serveur Rapport : Bilan sur l'adoption de l'informatique sans serveur

Cilium

Agent Check Check de l'Agent

Supported OS: Linux Mac OS Windows

Présentation

Ce check permet de surveiller Cilium avec l’Agent Datadog. L’intégration peut recueillir des métriques à partir de cilium-agent ou de cilium-operator.

Implémentation

Suivez les instructions ci-dessous pour installer et configurer ce check lorsque l’Agent est exécuté sur un host. Consultez la documentation relative aux modèles d’intégration Autodiscovery pour découvrir comment appliquer ces instructions à un environnement conteneurisé.

Installation

Le check Cilium est inclus avec le paquet de l’Agent Datadog, mais des opérations d’installation supplémentaires sont nécessaires pour l’exposition des métriques Prometheus.

  1. Pour activer les métriques Prometheus dans cilium-agent et cilium-operator, déployez Cilium en définissant la valeur Helm global.prometheus.enabled=true.

  2. Vous pouvez également activer les métriques Prometheus séparément :

    • Dans le cilium-agent, ajoutez --prometheus-serve-addr=:9090 à la section args de la configuration DaemonSet pour Cilium :

      # [...]
      spec:
      containers:
       - args:
           - --prometheus-serve-addr=:9090
    • Sinon, dans le cilium-operator, ajoutez --enable-metrics à la section args de la configuration de déploiement de Cilium :

      # [...]
      spec:
      containers:
       - args:
           - --enable-metrics

Configuration

Host

  1. Modifiez le fichier cilium.d/conf.yaml dans le dossier conf.d/ à la racine du répertoire de configuration de votre Agent pour commencer à recueillir vos données de performance Cilium. Consultez le fichier d’exemple cilium.d/conf.yaml pour découvrir toutes les options de configuration disponibles.

    • Pour recueillir les métriques cilium-agent activez l’option agent_endpoint.
    • Pour recueillir les métriques cilium-operator, activez l’option operator_endpoint.
  2. Redémarrez l’Agent.

Collecte de logs

Cilium génère deux types de logs : cilium-agent et cilium-operator.

  1. La collecte de logs est désactivée par défaut dans l’Agent Datadog. Vous devez l’activer dans votre configuration DaemonSet :

     # (...)
       env:
       #  (...)
         - name: DD_LOGS_ENABLED
             value: "true"
         - name: DD_LOGS_CONFIG_CONTAINER_COLLECT_ALL
             value: "true"
     # (...)
  2. Montez le socket Docker sur l’Agent Datadog comme dans ce manifeste. Si vous n’utilisez pas Docker, montez le répertoire /var/log/pods.

  3. Redémarrez l’Agent.

Environnement conteneurisé

Consultez la documentation relative aux modèles d’intégration Autodiscovery pour découvrir comment appliquer les paramètres ci-dessous à un environnement conteneurisé.

Collecte de métriques
ParamètreValeur
<NOM_INTÉGRATION>cilium
<CONFIG_INIT>vide ou {}
<CONFIG_INSTANCE>{"agent_endpoint": "http://%%host%%:9090/metrics"}
Collecte de logs

La collecte des logs est désactivée par défaut dans l’Agent Datadog. Pour l’activer, consultez la section Collecte de logs avec Docker.

ParamètreValeur
<CONFIG_LOG>{"source": "cilium-agent", "service": "cilium-agent"}

Validation

Lancez la sous-commande status de l’Agent et cherchez cilium dans la section Checks.

Données collectées

Métriques

cilium.agent.api_process_time.seconds.count
(count)
Count of processing time for all API calls
Shown as request
cilium.agent.api_process_time.seconds.sum
(gauge)
Sum of processing time for all API calls
Shown as second
cilium.agent.bootstrap.seconds.count
(count)
Count of bootstrap durations
cilium.agent.bootstrap.seconds.sum
(gauge)
Sum of bootstrap durations
Shown as second
cilium.bpf.map_ops.total
(count)
Total BPF map operations performed
Shown as operation
cilium.controllers.failing.count
(count)
Number of failing controllers
Shown as error
cilium.controllers.runs_duration.seconds.count
(count)
Count of controller processes duration
Shown as operation
cilium.controllers.runs_duration.seconds.sum
(gauge)
Sum of controller processes duration
Shown as second
cilium.controllers.runs.total
(count)
Total number of controller runs
Shown as event
cilium.datapath.conntrack_gc.duration.seconds.count
(count)
Count of garbage collector process duration
Shown as operation
cilium.datapath.conntrack_gc.duration.seconds.sum
(gauge)
Sum of garbage collector process duration
Shown as second
cilium.datapath.conntrack_gc.entries
(gauge)
The number of alive and deleted conntrack entries
Shown as garbage collection
cilium.datapath.conntrack_gc.key_fallbacks.total
(count)
The total number of conntrack entries
Shown as garbage collection
cilium.datapath.conntrack_gc.runs.total
(count)
Total number of the conntrack garbage collector process runs
Shown as garbage collection
cilium.datapath.errors.total
(count)
Total number of errors in datapath management
Shown as error
cilium.drop_bytes.total
(count)
Total dropped bytes
Shown as byte
cilium.drop_count.total
(count)
Total dropped packets
Shown as packet
cilium.endpoint.count
(count)
Total ready endpoints managed by agent
Shown as unit
cilium.endpoint.regeneration_time_stats.seconds.count
(count)
Count of endpoint regeneration time stats
Shown as operation
cilium.endpoint.regeneration_time_stats.seconds.sum
(gauge)
Sum of endpoint regeneration time stats
Shown as second
cilium.endpoint.regenerations.count
(count)
Count of completed endpoint regenerations
Shown as unit
cilium.endpoint.state
(gauge)
Count of all endpoints
Shown as unit
cilium.errors_warning.total
(count)
Total error warnings
Shown as error
cilium.event_timestamp
(gauge)
Last timestamp of event received
Shown as time
cilium.forward_bytes.total
(count)
Total forwarded bytes
Shown as byte
cilium.forward_count.total
(count)
Total forwarded packets
Shown as packet
cilium.fqdn.gc_deletions.total
(count)
Total number of FQDNs cleaned in FQDN garbage collector job
Shown as event
cilium.identity.count
(gauge)
Number of identities allocated
Shown as unit
cilium.ip_addresses.count
(gauge)
Number of allocated ip_addresses
Shown as unit
cilium.ipam.events.total
(count)
Number of IPAM events received by action and datapath family type
Shown as event
cilium.k8s_client.api_calls.count
(count)
Number of API calls made to kube-apiserver
Shown as request
cilium.k8s_client.api_latency_time.seconds.count
(count)
Count of processed API call duration
Shown as request
cilium.k8s_client.api_latency_time.seconds.sum
(gauge)
Sum of processed API call duration
Shown as second
cilium.kubernetes.events_received.total
(count)
Number of Kubernetes received events processed
Shown as event
cilium.kubernetes.events.total
(count)
Number of Kubernetes events processed
Shown as event
cilium.nodes.all_datapath_validations.total
(count)
Number of validation calls to implement the datapath implemention of a node
Shown as unit
cilium.nodes.all_events_received.total
(count)
Number of node events received
Shown as event
cilium.nodes.managed.total
(gauge)
Number of nodes managed
Shown as node
cilium.policy.count
(gauge)
Number of policies currently loaded
Shown as unit
cilium.policy.endpoint_enforcement_status
(gauge)
Number of endpoints labeled by polict enforcement status
Shown as unit
cilium.policy.import_errors.count
(count)
Number of failed policy imports
Shown as error
cilium.policy.l7_denied.total
(count)
Number of total L7 denied requests/responses due to policy
Shown as unit
cilium.policy.l7_forwarded.total
(count)
Number of total L7 forwarded requests/responses
Shown as unit
cilium.policy.l7_parse_errors.total
(count)
Number of total L7 parse errors
Shown as error
cilium.policy.l7_received.total
(count)
Number of total L7 received requests/responses
Shown as unit
cilium.policy.max_revision
(gauge)
Highest policy revision number in the agent
Shown as unit
cilium.policy.regeneration_time_stats.seconds.count
(count)
Policy regeneration time stats count
Shown as operation
cilium.policy.regeneration_time_stats.seconds.sum
(gauge)
Policy regeneration time stats count
Shown as second
cilium.policy.regeneration.total
(count)
Total number of successful policy regenerations
Shown as unit
cilium.process.cpu.seconds.total
(gauge)
Process CPU time in seconds
Shown as second
cilium.process.max_fds
(gauge)
Process file descriptor maximum
Shown as file
cilium.process.open_fds
(gauge)
Number of open file descriptors
Shown as file
cilium.process.resident_memory.bytes
(gauge)
Total resident memory bytes
Shown as byte
cilium.process.start_time.seconds
(gauge)
Processes start time
Shown as second
cilium.process.virtual_memory.bytes
(gauge)
Virtual memory bytes
Shown as byte
cilium.process.virtual_memory.max.bytes
(gauge)
Maximum virtual memory bytes
Shown as byte
cilium.subprocess.start.total
(count)
Number of times that Cilium has started a subprocess
Shown as unit
cilium.triggers_policy.update_call_duration.seconds.count
(count)
Count of policy update trigger duration
Shown as operation
cilium.triggers_policy.update_call_duration.seconds.sum
(gauge)
Sum of policy update trigger duration
Shown as second
cilium.triggers_policy.update_folds
(gauge)
Number of folds
Shown as unit
cilium.triggers_policy.update.total
(count)
Total number of policy update trigger invocations
Shown as unit
cilium.unreachable.health_endpoints
(gauge)
Number of health endpoints that cannot be reached
Shown as unit
cilium.unreachable.nodes
(gauge)
Number of nodes that cannot be reached
Shown as node
cilium.operator.process.cpu.seconds
(count)
Total user and system CPU time spent in seconds
Shown as second
cilium.operator.process.max_fds
(gauge)
Maximum number of open file descriptors
Shown as file
cilium.operator.process.open_fds
(gauge)
Number of open file descriptors
Shown as file
cilium.operator.process.resident_memory.bytes
(gauge)
Resident memory size in bytes
Shown as byte
cilium.operator.process.start_time.second
(gauge)
Start time of the process since unix epoch in seconds
Shown as second
cilium.operator.process.virtual_memory.bytes
(gauge)
Virtual memory size in bytes
Shown as byte
cilium.operator.process.virtual_memory_max.bytes
(gauge)
Maximum amount of virtual memory available in bytes
Shown as byte
cilium.kvstore.operations_duration.seconds.count
(count)
Duration of kvstore operation count
Shown as operation
cilium.kvstore.operations_duration.seconds.sum
(gauge)
Duration of kvstore operation sum
Shown as second
cilium.kvstore.events_queue.seconds.count
(count)
Count of duration in seconds of received event was blocked before it could be queued
cilium.kvstore.events_queue.seconds.sum
(gauge)
Sum of duration in seconds received event was blocked before it could be queued
Shown as second
cilium.operator.eni.available
(gauge)
Number of ENI with addresses available
Shown as unit
cilium.operator.eni.available.ips_per_subnet
(gauge)
Number of available IPs per subnet ID
Shown as unit
cilium.operator.eni.aws_api_duration.seconds.count
(count)
Count of duration of interactions with AWS API
Shown as request
cilium.operator.eni.aws_api_duration.seconds.sum
(gauge)
Sum of duration of interactions with AWS API
Shown as second
cilium.operator.eni.deficit_resolver.duration.seconds.count
(count)
Count of duration of deficit resolver trigger runs
Shown as operation
cilium.operator.eni.deficit_resolver.duration.seconds.sum
(gauge)
Sum of duration of deficit resolver trigger runs
Shown as second
cilium.operator.eni.deficit_resolver.folds
(gauge)
Current level of deficit resolver folding
Shown as unit
cilium.operator.eni.deficit_resolver.latency.seconds.count
(count)
Count of latency between deficit resolver queue and trigger run
Shown as operation
cilium.operator.eni.deficit_resolver.latency.seconds.sum
(gauge)
Sum of latency between deficit resolver queue and trigger run
Shown as second
cilium.operator.eni.deficit_resolver.queued.total
(gauge)
Number of queued deficit resolver triggers
Shown as event
cilium.operator.eni.ec2_resync.duration.seconds.count
(count)
Count of duration of ec2 resync trigger runs
Shown as operation
cilium.operator.eni.ec2_resync.duration.seconds.sum
(gauge)
Sum of duration of ec2 resync trigger runs
Shown as second
cilium.operator.eni.ec2_resync.folds
(gauge)
Current level of ec2 resync folding
Shown as unit
cilium.operator.eni.ec2_resync.latency.seconds.count
(count)
Count of latency between ec2 resync queue and trigger run
Shown as operation
cilium.operator.eni.ec2_resync.latency.seconds.sum
(gauge)
Sum of latency between ec2 resync queue and trigger run
Shown as second
cilium.operator.eni.ec2_resync.queued.total
(gauge)
Number of queued ec2 resync triggers
Shown as unit
cilium.operator.eni.interface_creation_ops
(count)
Number of ENIs allocated
Shown as operation
cilium.operator.eni.ips.total
(gauge)
Number of IPs allocated
Shown as unit
cilium.operator.eni.k8s_sync.duration.seconds.count
(count)
Count of duration of k8s sync trigger run
Shown as operation
cilium.operator.eni.k8s_sync.duration.seconds.sum
(gauge)
Sum of duration of k8s sync trigger run
Shown as second
cilium.operator.eni.k8s_sync.folds
(gauge)
Current level of k8s sync folding
Shown as second
cilium.operator.eni.k8s_sync.latency.seconds.count
(count)
Count of duration of k8s sync latency between queue and trigger run
Shown as operation
cilium.operator.eni.k8s_sync.latency.seconds.sum
(gauge)
Sum of duration of k8s sync latency between queue and trigger run
Shown as second
cilium.operator.eni.k8s_sync.queued.total
(gauge)
Number of queued k8s sync triggers
Shown as unit
cilium.operator.eni.nodes.total
(gauge)
Number of nodes by category
Shown as node
cilium.operator.eni.resync.total
(count)
Number of resync operations to synchronize AWS EC2 metadata
Shown as unit

Checks de service

cilium.prometheus.health : Renvoie CRITICAL si l’Agent ne parvient pas à se connecter aux endpoints de métriques. Si ce n’est pas le cas, renvoie OK.

Événements

Cilium n’inclut aucun événement.

Dépannage

Besoin d’aide ? Contactez l’assistance Datadog.