이 페이지는 아직 한국어로 제공되지 않습니다. 번역 작업 중입니다.
현재 번역 프로젝트에 대한 질문이나 피드백이 있으신 경우
언제든지 연락주시기 바랍니다.GPU Monitoring is not available for the site.
Join the Preview!
GPU Monitoring is in Preview. To join the preview, click Request Access and complete the form.
Request AccessOverview
Datadog’s GPU Monitoring provides a centralized view into GPU fleet health, cost, and performance. GPU Monitoring enables teams to make better provisioning decisions, troubleshoot failed workloads, and eliminate idle GPU costs without having to manually set up individual vendor tools (like NVIDIA’s DCGM). You can access insights into your GPU fleet by deploying the Datadog Agent.
For setup instructions, see Set up GPU Monitoring.
Make data-driven GPU allocation and provisioning decisions
With visibility into GPU utilization by host, node, or pod, you can identify hotspots or underutilization of expensive GPU infrastructure.
Troubleshoot failed workloads due to resource contention
Understand your current device availability and forecast how many devices are needed for certain teams or workloads to avoid failed workloads from resource contention.
Identify and eliminate wasted, idle GPU costs
Identify total spend on GPU infrastructure and attribute those costs to specific workloads and instances. Directly correlate GPU usage to related pods or processes.
With GPU Monitoring’s resource telemetry, you can analyze trends in GPU resources and metrics (including GPU utilization, power, and memory) over time, helping you understand their effects on your model and application performance.
Ready to start?
See Set up GPU Monitoring for instructions on how to set up Datadog’s GPU Monitoring.
Further Reading