Google Kubernetes Engine

概要

Google Kubernetes Engine (GKE) は、Docker コンテナを実行するための強力なクラスターマネージャーおよびオーケストレーションシステムです。

Google Kubernetes Engine からメトリクスを取得して、以下のことができます。

  • GKE コンテナおよび GKE コントロールプレーンのパフォーマンスを可視化する。
  • GKE コンテナとアプリケーションのパフォーマンスを関連付ける。

このインテグレーションには、2 つの独立したプリセットダッシュボードが付属しています。

  • 標準の GKE ダッシュボードには、Google インテグレーションから収集した GKE と GKE コントロールプレーンのメトリクスが表示されます。
  • 拡張 GKE ダッシュボードには、Datadog の Agent ベースの Kubernetes インテグレーションからのメトリクスと、Google インテグレーションから収集した GKE コントロールプレーンメトリクスが表示されます。

標準のダッシュボードは、簡単なコンフィギュレーションで GKE での観測が可能です。拡張ダッシュボードは、追加のコンフィギュレーション手順が必要ですが、よりリアルタイムの Kubernetes メトリクスを提供し、本番でワークロードを監視するためのダッシュボードをクローンしてカスタマイズする際に、しばしばより良い場所から始めることができます。

セルフホスティングの Kubernetes クラスターとは異なり、GKE コントロールプレーンは Google によって管理され、クラスターで動作する Datadog Agent からはアクセスできません。そのため、クラスターの監視に Datadog Agent を主に使用している場合でも、GKE コントロールプレーンへの観測可能性は Google とのインテグレーションを必要とします。

セットアップ

メトリクスの収集

インストール

まだの方は、まずGoogle Cloud Platform とのインテグレーションを設定してください。標準のメトリクスとプリセットダッシュボードについては、他のインストール手順はありません。

拡張ダッシュボードを表示し、APM トレース、ロギング、プロファイリング、セキュリティ、およびその他の Datadog サービスを有効にするには、GKE クラスターに Datadog Agent をインストールします

ログの収集

Google Kubernetes Engine のログは Google Cloud Logging により収集され、HTTP プッシュフォワーダーを使用して Cloud Pub/Sub へ送信されます。HTTP プッシュフォワーダーを使用した Cloud Pub/Sub をまだセットアップしていない場合は、これをセットアップしてください。

これが完了したら、Google Kubernetes Engine のログを Google Cloud Logging から Pub/Sub へエクスポートします。

  1. GCP Logs Explorer ページにアクセスし、Kubernetes と GKE のログをフィルタリングします。

  2. シンクを作成し、シンクに適宜名前を付けます。

  3. エクスポート先として「Cloud Pub/Sub」を選択し、エクスポート用に作成された Pub/Sub を選択します。: この Pub/Sub は別のプロジェクト内に配置することもできます。

    Google Cloud Pub/Sub ログを Pub Sub へエクスポート
  4. 作成をクリックし、確認メッセージが表示されるまで待ちます。

収集データ

メトリクス

gcp.gke.container.accelerator.duty_cycle
(gauge)
Percent of time over the past sample period during which the accelerator was actively processing.
Shown as percent
gcp.gke.container.accelerator.memory_total
(gauge)
Total accelerator memory.
Shown as byte
gcp.gke.container.accelerator.memory_used
(gauge)
Total accelerator memory allocated.
Shown as byte
gcp.gke.container.accelerator.request
(gauge)
Number of accelerator devices requested by the container.
Shown as device
gcp.gke.container.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used by the container.
Shown as second
gcp.gke.container.cpu.limit_cores
(gauge)
CPU cores limit of the container.
Shown as core
gcp.gke.container.cpu.limit_utilization
(gauge)
Fraction of the CPU limit that is currently in use on the instance.
Shown as fraction
gcp.gke.container.cpu.request_cores
(gauge)
Number of CPU cores requested by the container.
Shown as core
gcp.gke.container.cpu.request_utilization
(gauge)
Fraction of the requested CPU that is currently in use on the instance.
Shown as fraction
gcp.gke.container.ephemeral_storage.limit_bytes
(gauge)
Local ephemeral storage limit.
Shown as byte
gcp.gke.container.ephemeral_storage.request_bytes
(gauge)
Local ephemeral storage request.
Shown as byte
gcp.gke.container.ephemeral_storage.used_bytes
(gauge)
Local ephemeral storage usage.
Shown as byte
gcp.gke.container.memory.limit_bytes
(gauge)
Memory limit of the container.
Shown as byte
gcp.gke.container.memory.limit_utlization
(gauge)
Fraction of the memory limit that is currently in use on the instance.
Shown as fraction
gcp.gke.container.memory.page_fault_count
(count)
Number of page faults, broken down by type.
Shown as fault
gcp.gke.container.memory.request_bytes
(gauge)
Memory request of the container.
Shown as byte
gcp.gke.container.memory.request_utilization
(gauge)
Fraction of the requested memory that is currently in use on the instance.
Shown as fraction
gcp.gke.container.memory.used_bytes
(gauge)
Memory usage of the container.
Shown as byte
gcp.gke.container.restart_count
(count)
Number of times the container has restarted.
Shown as occurrence
gcp.gke.container.uptime
(gauge)
Time in seconds that the container has been running.
Shown as second
gcp.gke.node.cpu.allocatable_cores
(gauge)
Number of allocatable CPU cores on the node.
Shown as core
gcp.gke.node.cpu.allocatable_utilization
(gauge)
Fraction of the allocatable CPU that is currently in use on the instance.
Shown as fraction
gcp.gke.node.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used on the node.
Shown as second
gcp.gke.node.cpu.total_cores
(gauge)
Total number of CPU cores on the node.
Shown as core
gcp.gke.node.ephemeral_storage.allocatable_bytes
(gauge)
Local ephemeral storage bytes allocatable on the node.
Shown as byte
gcp.gke.node.ephemeral_storage.inodes_free
(gauge)
Free number of inodes on local ephemeral storage.
gcp.gke.node.ephemeral_storage.inodes_total
(gauge)
Total number of inodes on local ephemeral storage.
gcp.gke.node.ephemeral_storage.total_bytes
(gauge)
Total ephemeral storage bytes on the node.
Shown as byte
gcp.gke.node.ephemeral_storage.used_bytes
(gauge)
Local ephemeral storage bytes used by the node.
Shown as byte
gcp.gke.node.memory.allocatable_bytes
(gauge)
Cumulative memory bytes used by the node.
Shown as byte
gcp.gke.node.memory.allocatable_utilization
(gauge)
Fraction of the allocatable memory that is currently in use on the instance.
Shown as fraction
gcp.gke.node.memory.total_bytes
(gauge)
Number of bytes of memory allocatable on the node.
Shown as byte
gcp.gke.node.memory.used_bytes
(gauge)
Cumulative memory bytes used by the node.
Shown as byte
gcp.gke.node.network.received_bytes_count
(count)
Cumulative number of bytes received by the node over the network.
Shown as byte
gcp.gke.node.network.sent_bytes_count
(count)
Cumulative number of bytes transmitted by the node over the network.
Shown as byte
gcp.gke.node.pid_limit
(gauge)
Max PID of OS on the node.
gcp.gke.node.pid_used
(gauge)
Number of running process in the OS on the node.
gcp.gke.node_daemon.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used by the node level system daemon.
Shown as second
gcp.gke.node_daemon.memory.used_bytes
(gauge)
Memory usage by the system daemon.
Shown as byte
gcp.gke.pod.network.received_bytes_count
(count)
Cumulative number of bytes received by the pod over the network.
Shown as byte
gcp.gke.pod.network.sent_bytes_count
(count)
Cumulative number of bytes transmitted by the pod over the network.
Shown as byte
gcp.gke.pod.volume.total_bytes
(gauge)
Total number of disk bytes available to the pod.
Shown as byte
gcp.gke.pod.volume.used_bytes
(gauge)
Number of disk bytes used by the pod.
Shown as byte
gcp.gke.pod.volume.utilization
(gauge)
Fraction of the volume that is currently being used by the instance.
Shown as fraction
gcp.gke.control_plane.apiserver.admission_controller_admission_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.apiserver.admission_step_admission_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.apiserver.admission_webhook_admission_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.apiserver.aggregator_unavailable_apiservice
(gauge)
gcp.gke.control_plane.apiserver.audit_event_total
(gauge)
Accumulated number audit events generated and sent to the audit backend
Shown as event
gcp.gke.control_plane.apiserver.audit_level_total
(gauge)
gcp.gke.control_plane.apiserver.audit_requests_rejected_total
(gauge)

Shown as request
gcp.gke.control_plane.apiserver.client_certificate_expiration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.apiserver.current_inflight_requests
(gauge)
Maximal number of currently used inflight request limit of this apiserver per request kind.
Shown as request
gcp.gke.control_plane.apiserver.etcd_object_counts
(gauge)
Number of stored objects split by kind.
Shown as object
gcp.gke.control_plane.apiserver.etcd_request_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.apiserver.init_events_total
(gauge)

Shown as event
gcp.gke.control_plane.apiserver.longrunning_gauge
(gauge)
Gauge of all active long-running apiserver requests.
Shown as request
gcp.gke.control_plane.apiserver.registered_watchers
(gauge)
Number of currently registered watchers for a given resource.
Shown as object
gcp.gke.control_plane.apiserver.request_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.apiserver.request_total
(gauge)
Accumulated number of apiserver requests.
Shown as request
gcp.gke.control_plane.apiserver.response_sizes
(gauge)
gcp.gke.control_plane.apiserver.workqueue_adds_total
(count)
gcp.gke.control_plane.apiserver.workqueue_depth
(gauge)
gcp.gke.control_plane.apiserver.workqueue_longest_running_processor_seconds
(gauge)
Number of seconds that the longest running processor has been running.
Shown as second
gcp.gke.control_plane.apiserver.workqueue_queue_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.apiserver.workqueue_retries_total
(count)
gcp.gke.control_plane.apiserver.workqueue_unfinished_work_seconds
(gauge)

Shown as second
gcp.gke.control_plane.apiserver.workqueue_work_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.controller_manager.cloudprovider_gce_api_request_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.controller_manager.cronjob_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by cronjob controller
gcp.gke.control_plane.controller_manager.daemon_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by daemon controller
gcp.gke.control_plane.controller_manager.deployment_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by deployment controller
gcp.gke.control_plane.controller_manager.endpoint_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by endpoint controller
gcp.gke.control_plane.controller_manager.gc_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by GC controller
gcp.gke.control_plane.controller_manager.job_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by job controller
gcp.gke.control_plane.controller_manager.leader_election_master_status
(gauge)
gcp.gke.control_plane.controller_manager.namespace_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by namespace controller
gcp.gke.control_plane.controller_manager.node_collector_evictions_number
(count)
Count of node eviction events.
gcp.gke.control_plane.controller_manager.node_collector_unhealthy_nodes_in_zone
(gauge)
Number of unhealthy nodes
gcp.gke.control_plane.controller_manager.node_collector_zone_health
(gauge)
gcp.gke.control_plane.controller_manager.node_collector_zone_size
(gauge)
gcp.gke.control_plane.controller_manager.node_ipam_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by IPAM controller
gcp.gke.control_plane.controller_manager.node_lifecycle_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by lifecycle controller
gcp.gke.control_plane.controller_manager.persistentvolume_protection_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by persistent volume protection controller
gcp.gke.control_plane.controller_manager.persistentvolumeclaim_protection_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by persistent volume claim protection controller
gcp.gke.control_plane.controller_manager.replicaset_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by ReplicaSet controller
gcp.gke.control_plane.controller_manager.replication_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by replication controller
gcp.gke.control_plane.controller_manager.route_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by route controller
gcp.gke.control_plane.controller_manager.service_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by service controller
gcp.gke.control_plane.controller_manager.serviceaccount_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by service account controller
gcp.gke.control_plane.controller_manager.serviceaccount_tokens_controller_rate_limiter_use
(gauge)
Usage of the rate limiter by service account tokens controller
gcp.gke.control_plane.controller_manager.workqueue_adds_total
(count)
gcp.gke.control_plane.controller_manager.workqueue_depth
(gauge)
gcp.gke.control_plane.controller_manager.workqueue_longest_running_processor_seconds
(gauge)
Number of seconds that the longest running processor has been running.
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_queue_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.controller_manager.workqueue_retries_total
(count)
gcp.gke.control_plane.controller_manager.workqueue_unfinished_work_seconds
(gauge)

Shown as second
gcp.gke.control_plane.controller_manager.workqueue_work_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.scheduler.binding_duration_seconds
(gauge)
Number of latency in seconds.
Shown as second
gcp.gke.control_plane.scheduler.e2e_scheduling_duration_seconds
(gauge)
Total e2e scheduling latency.
Shown as second
gcp.gke.control_plane.scheduler.framework_extension_point_duration_seconds
(gauge)

Shown as second
gcp.gke.control_plane.scheduler.leader_election_master_status
(gauge)
gcp.gke.control_plane.scheduler.pending_pods
(gauge)
gcp.gke.control_plane.scheduler.preemption_attempts_total
(count)
Number of preemption attempts in the cluster.
gcp.gke.control_plane.scheduler.preemption_victims
(gauge)
Number of selected pods during the latest preemption round.
gcp.gke.control_plane.scheduler.schedule_attempts_total
(gauge)
Number of attempts to schedule pods.
gcp.gke.control_plane.scheduler.scheduling_algorithm_duration_seconds
(gauge)
Total scheduling algorithm latency.
Shown as second
gcp.gke.control_plane.scheduler.scheduling_algorithm_preemption_evaluation_seconds
(gauge)

Shown as second

イベント

Google Kubernetes Engine インテグレーションには、イベントは含まれません。

サービスのチェック

Google Kubernetes Engine インテグレーションには、サービスのチェック機能は含まれません。

トラブルシューティング

ご不明な点は、Datadog のサポートチームまでお問合せください。