- 필수 기능
- 시작하기
- Glossary
- 표준 속성
- Guides
- Agent
- 통합
- 개방형텔레메트리
- 개발자
- Administrator's Guide
- API
- Datadog Mobile App
- CoScreen
- Cloudcraft
- 앱 내
- 서비스 관리
- 인프라스트럭처
- 애플리케이션 성능
- APM
- Continuous Profiler
- 스팬 시각화
- 데이터 스트림 모니터링
- 데이터 작업 모니터링
- 디지털 경험
- 소프트웨어 제공
- 보안
- AI Observability
- 로그 관리
- 관리
Supported OS
Monitoring Oracle Cloud Infrastructure (OCI) GPU instances is essential for ensuring optimal performance and reliability of your high-performance computing workloads. This integration provides a comprehensive set of GPU metrics through the gpu_infrastructure_health namespace, enabling you to track various aspects of GPU health and utilization.
This integration lets you monitor and alert on the health, capacity, throughput, status, and performance of your GPU Instances.
It collects metrics and tags from the gpu_infrastructure_health namespace.
After setting up the Oracle Cloud Infrastructure integration, ensure that any namespaces mentioned above are included in your Connector Hub.
oci.gpu_infrastructure_health.gpu_ecc_double_bit_errors (count) | The number of GPU double-bit ECC errors reported. Shown as error |
oci.gpu_infrastructure_health.gpu_ecc_single_bit_errors (count) | The number of GPU single-bit ECC errors reported. Shown as error |
oci.gpu_infrastructure_health.gpu_memory_utilization (gauge) | The percentage of the GPU memory resource in use. Shown as percent |
oci.gpu_infrastructure_health.gpu_power_draw (gauge) | The amount of GPU power used. |
oci.gpu_infrastructure_health.gpu_temperature (gauge) | The GPU temperature reported. |
oci.gpu_infrastructure_health.gpu_utilization (gauge) | Activity level from GPU. Expressed as a percentage of total time. For instance pools, the value is averaged across all instances in the pool. Shown as percent |
OCI GPU does not include any service checks.
OCI GPU does not include any events.
Need help? Contact Datadog support.