Google Kubernetes Engine, Cloud

概要

Google Kubernetes Engine (GKE) は、Docker コンテナを実行するための強力なクラスターマネージャーおよびオーケストレーションシステムです。

Google Kubernetes Engine からメトリクスを取得して、以下のことができます。

  • GKE コンテナおよび GKE コントロールプレーンのパフォーマンスを可視化する。
  • GKE コンテナとアプリケーションのパフォーマンスを関連付ける。

このインテグレーションには、2 つの独立したプリセットダッシュボードが付属しています。

  • 標準の GKE ダッシュボードには、Google インテグレーションから収集した GKE と GKE コントロールプレーンのメトリクスが表示されます。
  • 拡張 GKE ダッシュボードには、Datadog の Agent ベースの Kubernetes インテグレーションからのメトリクスと、Google インテグレーションから収集した GKE コントロールプレーンメトリクスが表示されます。

標準のダッシュボードは、簡単なコンフィギュレーションで GKE での観測が可能です。拡張ダッシュボードは、追加のコンフィギュレーション手順が必要ですが、よりリアルタイムの Kubernetes メトリクスを提供し、本番でワークロードを監視するためのダッシュボードをクローンしてカスタマイズする際に、しばしばより良い場所から始めることができます。

セルフホスティングの Kubernetes クラスターとは異なり、GKE コントロールプレーンは Google によって管理され、クラスターで動作する Datadog Agent からはアクセスできません。そのため、クラスターの監視に Datadog Agent を主に使用している場合でも、GKE コントロールプレーンへの観測可能性は Google とのインテグレーションを必要とします。

セットアップ

メトリクスの収集

Installation

  1. まだの方は、まずGoogle Cloud Platform とのインテグレーションを設定してください。標準のメトリクスとプリセットダッシュボードについては、他のインストール手順はありません。

  2. 拡張ダッシュボードを表示し、APM トレース、ロギング、プロファイリング、セキュリティ、およびその他の Datadog サービスを有効にするには、GKE クラスターに Datadog Agent をインストールします

  3. コントロールプレーンメトリクスを入力するには、GKE コントロールプレーンメトリクスを有効にする必要があります。コントロールプレーンメトリクスは、Google が GKE で管理している Kubernetes コントロールプレーンの動作を可視化するものです。

収集データ

Google Kubernetes Engine のログは Google Cloud Logging で収集され、Cloud Pub/Sub トピックを通じて Dataflow ジョブに送信されます。まだの場合は、Datadog Dataflow テンプレートでロギングをセットアップしてください

これが完了したら、Google Kubernetes Engine のログを Google Cloud Logging から Pub/Sub トピックへエクスポートします。

  1. GCP Logs Explorer ページにアクセスし、Kubernetes と GKE のログをフィルタリングします。

  2. シンクを作成し、シンクに適宜名前を付けます。

  3. 宛先として “Cloud Pub/Sub” を選択し、その目的で作成された Pub/Sub トピックを選択します。: Pub/Sub トピックは別のプロジェクトに配置できます。

    Google Cloud Pub/Sub ログを Pub Sub へエクスポート
  4. 作成をクリックし、確認メッセージが表示されるまで待ちます。

収集データ

メトリクス

gcp.gke.container.accelerator.duty_cycle
(gauge)
Percent of time over the past sample period during which the accelerator was actively processing.
Shown as percent
gcp.gke.container.accelerator.memory_total
(gauge)
Total accelerator memory.
Shown as byte
gcp.gke.container.accelerator.memory_used
(gauge)
Total accelerator memory allocated.
Shown as byte
gcp.gke.container.accelerator.request
(gauge)
Number of accelerator devices requested by the container.
Shown as device
gcp.gke.container.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used by the container.
Shown as second
gcp.gke.container.cpu.limit_cores
(gauge)
CPU cores limit of the container.
Shown as core
gcp.gke.container.cpu.limit_utilization
(gauge)
Fraction of the CPU limit that is currently in use on the instance.
Shown as fraction
gcp.gke.container.cpu.request_cores
(gauge)
Number of CPU cores requested by the container.
Shown as core
gcp.gke.container.cpu.request_utilization
(gauge)
Fraction of the requested CPU that is currently in use on the instance.
Shown as fraction
gcp.gke.container.ephemeral_storage.limit_bytes
(gauge)
Local ephemeral storage limit.
Shown as byte
gcp.gke.container.ephemeral_storage.request_bytes
(gauge)
Local ephemeral storage request.
Shown as byte
gcp.gke.container.ephemeral_storage.used_bytes
(gauge)
Local ephemeral storage usage.
Shown as byte
gcp.gke.container.memory.limit_bytes
(gauge)
Memory limit of the container.
Shown as byte
gcp.gke.container.memory.limit_utlization
(gauge)
Fraction of the memory limit that is currently in use on the instance.
Shown as fraction
gcp.gke.container.memory.page_fault_count
(count)
Number of page faults, broken down by type.
Shown as fault
gcp.gke.container.memory.request_bytes
(gauge)
Memory request of the container.
Shown as byte
gcp.gke.container.memory.request_utilization
(gauge)
Fraction of the requested memory that is currently in use on the instance.
Shown as fraction
gcp.gke.container.memory.used_bytes
(gauge)
Memory usage of the container.
Shown as byte
gcp.gke.container.restart_count
(count)
Number of times the container has restarted.
Shown as occurrence
gcp.gke.container.uptime
(gauge)
Time in seconds that the container has been running.
Shown as second
gcp.gke.node.cpu.allocatable_cores
(gauge)
Number of allocatable CPU cores on the node.
Shown as core
gcp.gke.node.cpu.allocatable_utilization
(gauge)
Fraction of the allocatable CPU that is currently in use on the instance.
Shown as fraction
gcp.gke.node.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used on the node.
Shown as second
gcp.gke.node.cpu.total_cores
(gauge)
Total number of CPU cores on the node.
Shown as core
gcp.gke.node.ephemeral_storage.allocatable_bytes
(gauge)
Local ephemeral storage bytes allocatable on the node.
Shown as byte
gcp.gke.node.ephemeral_storage.inodes_free
(gauge)
Free number of inodes on local ephemeral storage.
gcp.gke.node.ephemeral_storage.inodes_total
(gauge)
Total number of inodes on local ephemeral storage.
gcp.gke.node.ephemeral_storage.total_bytes
(gauge)
Total ephemeral storage bytes on the node.
Shown as byte
gcp.gke.node.ephemeral_storage.used_bytes
(gauge)
Local ephemeral storage bytes used by the node.
Shown as byte
gcp.gke.node.memory.allocatable_bytes
(gauge)
Cumulative memory bytes used by the node.
Shown as byte
gcp.gke.node.memory.allocatable_utilization
(gauge)
Fraction of the allocatable memory that is currently in use on the instance.
Shown as fraction
gcp.gke.node.memory.total_bytes
(gauge)
Number of bytes of memory allocatable on the node.
Shown as byte
gcp.gke.node.memory.used_bytes
(gauge)
Cumulative memory bytes used by the node.
Shown as byte
gcp.gke.node.network.received_bytes_count
(count)
Cumulative number of bytes received by the node over the network.
Shown as byte
gcp.gke.node.network.sent_bytes_count
(count)
Cumulative number of bytes transmitted by the node over the network.
Shown as byte
gcp.gke.node.pid_limit
(gauge)
Max PID of OS on the node.
gcp.gke.node.pid_used
(gauge)
Number of running process in the OS on the node.
gcp.gke.node_daemon.cpu.core_usage_time
(count)
Cumulative CPU usage on all cores used by the node level system daemon.
Shown as second
gcp.gke.node_daemon.memory.used_bytes
(gauge)
Memory usage by the system daemon.
Shown as byte
gcp.gke.pod.network.received_bytes_count
(count)
Cumulative number of bytes received by the pod over the network.
Shown as byte
gcp.gke.pod.network.sent_bytes_count
(count)
Cumulative number of bytes transmitted by the pod over the network.
Shown as byte
gcp.gke.pod.volume.total_bytes
(gauge)
Total number of disk bytes available to the pod.
Shown as byte
gcp.gke.pod.volume.used_bytes
(gauge)
Number of disk bytes used by the pod.
Shown as byte
gcp.gke.pod.volume.utilization
(gauge)
Fraction of the volume that is currently being used by the instance.
Shown as fraction
gcp.gke.control_plane.apiserver.admission_controller_admission_duration_seconds
(gauge)
Admission controller latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.admission_step_admission_duration_seconds
(gauge)
Admission sub-step latency histogram in seconds, broken out for each operation and API resource and step type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.admission_webhook_admission_duration_seconds
(gauge)
Admission webhook latency histogram in seconds, identified by name and broken out for each operation and API resource and type (validate or admit).
Shown as second
gcp.gke.control_plane.apiserver.current_inflight_requests
(gauge)
Maximal number of currently used inflight request limit of this apiserver per request kind.
Shown as request
gcp.gke.control_plane.apiserver.request_duration_seconds
(gauge)
Response latency distribution in seconds for each verb, dry run value, group, version, resource, subresource, scope and component.
Shown as second
gcp.gke.control_plane.apiserver.request_total
(gauge)
Counter of apiserver requests broken out for each verb, dry run value, group, version, resource, scope, component, and HTTP response code.
Shown as request
gcp.gke.control_plane.apiserver.response_sizes
(gauge)
Response size distribution in bytes for each group, version, verb, resource, subresource, scope and component.
Shown as byte
gcp.gke.control_plane.apiserver.storage_objects
(gauge)
Number of stored objects at the time of last check split by kind.
Shown as object
gcp.gke.control_plane.controller_manager.node_collector_evictions_number
(count)
Number of Node evictions that happened since current instance of NodeController started.
Shown as event
gcp.gke.control_plane.scheduler.pending_pods
(gauge)
Number of pending pods, by the queue type.
Shown as event
gcp.gke.control_plane.scheduler.pod_scheduling_duration_seconds
(gauge)
E2e latency for a pod being scheduled
Shown as second
gcp.gke.control_plane.scheduler.preemption_attempts_total
(count)
Total preemption attempts in the cluster till now
Shown as attempt
gcp.gke.control_plane.scheduler.preemption_victims
(gauge)
Number of selected preemption victims
Shown as event
gcp.gke.control_plane.scheduler.scheduling_attempt_duration_seconds
(gauge)
Scheduling attempt latency in seconds
Shown as second
gcp.gke.control_plane.scheduler.schedule_attempts_total
(gauge)
Number of attempts to schedule pods.
Shown as attempt
gcp.gke.control_plane.apiserver.aggregator_unavailable_apiservice
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.audit_event_total
(gauge)
(Deprecated) Accumulated number audit events generated and sent to the audit backend
Shown as event
gcp.gke.control_plane.apiserver.audit_level_total
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.audit_requests_rejected_total
(gauge)
(Deprecated)
Shown as request
gcp.gke.control_plane.apiserver.client_certificate_expiration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.etcd_object_counts
(gauge)
(Deprecated) Number of stored objects split by kind.
Shown as object
gcp.gke.control_plane.apiserver.etcd_request_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.init_events_total
(gauge)
(Deprecated)
Shown as event
gcp.gke.control_plane.apiserver.longrunning_gauge
(gauge)
(Deprecated) Gauge of all active long-running apiserver requests.
Shown as request
gcp.gke.control_plane.apiserver.registered_watchers
(gauge)
(Deprecated) Number of currently registered watchers for a given resource.
Shown as object
gcp.gke.control_plane.apiserver.workqueue_adds_total
(count)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_depth
(gauge)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_longest_running_processor_seconds
(gauge)
(Deprecated) Number of seconds that the longest running processor has been running.
Shown as second
gcp.gke.control_plane.apiserver.workqueue_queue_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.workqueue_retries_total
(count)
(Deprecated)
gcp.gke.control_plane.apiserver.workqueue_unfinished_work_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.apiserver.workqueue_work_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.cloudprovider_gce_api_request_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.cronjob_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by cronjob controller
gcp.gke.control_plane.controller_manager.daemon_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by daemon controller
gcp.gke.control_plane.controller_manager.deployment_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by deployment controller
gcp.gke.control_plane.controller_manager.endpoint_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by endpoint controller
gcp.gke.control_plane.controller_manager.gc_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by GC controller
gcp.gke.control_plane.controller_manager.job_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by job controller
gcp.gke.control_plane.controller_manager.leader_election_master_status
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.namespace_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by namespace controller
gcp.gke.control_plane.controller_manager.node_collector_evictions_number
(count)
(Deprecated) Count of node eviction events.
gcp.gke.control_plane.controller_manager.node_collector_unhealthy_nodes_in_zone
(gauge)
(Deprecated) Number of unhealthy nodes
gcp.gke.control_plane.controller_manager.node_collector_zone_health
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.node_collector_zone_size
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.node_ipam_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by IPAM controller
gcp.gke.control_plane.controller_manager.node_lifecycle_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by lifecycle controller
gcp.gke.control_plane.controller_manager.persistentvolume_protection_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by persistent volume protection controller
gcp.gke.control_plane.controller_manager.persistentvolumeclaim_protection_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by persistent volume claim protection controller
gcp.gke.control_plane.controller_manager.replicaset_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by ReplicaSet controller
gcp.gke.control_plane.controller_manager.replication_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by replication controller
gcp.gke.control_plane.controller_manager.route_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by route controller
gcp.gke.control_plane.controller_manager.service_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service controller
gcp.gke.control_plane.controller_manager.serviceaccount_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service account controller
gcp.gke.control_plane.controller_manager.serviceaccount_tokens_controller_rate_limiter_use
(gauge)
(Deprecated) Usage of the rate limiter by service account tokens controller
gcp.gke.control_plane.controller_manager.workqueue_adds_total
(count)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_depth
(gauge)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_longest_running_processor_seconds
(gauge)
(Deprecated) Number of seconds that the longest running processor has been running.
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_queue_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_retries_total
(count)
(Deprecated)
gcp.gke.control_plane.controller_manager.workqueue_unfinished_work_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.controller_manager.workqueue_work_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.scheduler.binding_duration_seconds
(gauge)
(Deprecated) Number of latency in seconds.
Shown as second
gcp.gke.control_plane.scheduler.e2e_scheduling_duration_seconds
(gauge)
(Deprecated) Total e2e scheduling latency.
Shown as second
gcp.gke.control_plane.scheduler.framework_extension_point_duration_seconds
(gauge)
(Deprecated)
Shown as second
gcp.gke.control_plane.scheduler.leader_election_master_status
(gauge)
(Deprecated)
gcp.gke.control_plane.scheduler.scheduling_algorithm_duration_seconds
(gauge)
(Deprecated) Total scheduling algorithm latency.
Shown as second
gcp.gke.control_plane.scheduler.scheduling_algorithm_preemption_evaluation_seconds
(gauge)
(Deprecated)
Shown as second

イベント

Google Kubernetes Engine インテグレーションには、イベントは含まれません。

サービスチェック

Google Kubernetes Engine インテグレーションには、サービスのチェック機能は含まれません。

トラブルシューティング

ご不明な点は、Datadog のサポートチームまでお問い合わせください。