Ray

Supported OS Linux Windows Mac OS

통합 버전2.2.0

개요

이 점검은 Datadog Agent를 통해 Ray를 모니터링합니다. Ray는 강화 학습, 딥러닝, 튜닝, 모델 제공 등 AI 및 Python 워크로드를 쉽게 확장할 수 있도록 지원하는 오픈소스 통합 컴퓨팅 프레임워크입니다.

설정

아래 지침을 따라 호스트에서 실행되는 에이전트에 대해 이 점검을 설치하고 설정하세요. 컨테이너화된 환경의 경우 이러한 지침을 적용하는 데 가이드가 필요하면 오토파일럿 통합 템플릿을 참조하세요.

설치

Agent 릴리스 7.49.0부터 Ray 점검이 Datadog Agent 패키지에 포함되었습니다. 서버에 추가 설치가 필요하지 않습니다.

주의: 이 점검은 Ray가 노출할 수 있는 OpenMetrics 엔드포인트에서 메트릭을 수집하기 위해 OpenMetrics를 사용하며, 이를 위해 Python 3이 필요합니다.

구성

호스트

메트릭 수집
  1. Agent의 구성 디렉터리 루트에 있는 conf.d/ 폴더의 ray.d/conf.yaml 파일을 편집하여 Ray 성능 데이터 수집을 시작하세요. 사용 가능한 모든 구성 옵션은 샘플 구성 파일을 참고하세요.

    이 예시는 해당 구성을 보여줍니다.

    init_config:
      ...
    instances:
      - openmetrics_endpoint: http://<RAY_ADDRESS>:8080
    
  2. 구성을 수정한 후 Agent를 다시 시작합니다.

Docker

메트릭 수집

이 예제는 docker-compose.yml 내부에서 Docker 레이블로 구성하는 방법을 보여줍니다. 사용 가능한 모든 구성 옵션은 샘플 구성 파일을 참고하세요.

labels:
  com.datadoghq.ad.checks: '{"ray":{"instances":[{"openmetrics_endpoint":"http://%%host%%:8080"}]}}'

쿠버네티스(Kubernetes)

메트릭 수집

이 예제는 Ray 포드에서 Kubernetes 주석을 구성하는 방법을 보여줍니다. 사용 가능한 모든 구성 옵션은 샘플 구성 파일을 참고하세요.

apiVersion: v1
kind: Pod
metadata:
  name: '<POD_NAME>'
  annotations:
    ad.datadoghq.com/ray.checks: |-
      {
        "ray": {
          "instances": [
            {
              "openmetrics_endpoint": "http://%%host%%:8080"
            }
          ]
        }
      }
    # (...)
spec:
  containers:
    - name: 'ray'
# (...)

Ray 메트릭은 OpenMetrics 엔드포인트에서 사용할 수 있습니다. 또한 Ray를 사용하면 사용자 지정 애플리케이션 수준 메트릭을 내보낼 수 있습니다. extra_metrics 옵션을 사용하여 이러한 메트릭을 수집하도록 Ray 통합을 구성할 수 있습니다. 사용자 지정 메트릭을 포함한 모든 Ray 메트릭은 ray. 접두사를 사용합니다.

참고: 사용자 지정 Ray 메트릭은 Datad에서 표준 메트릭으로 간주됩니다.

이 예제는 extra_metrics 옵션을 활용하는 구성을 보여줍니다.

init_config:
  ...
instances:
  - openmetrics_endpoint: http://<RAY_ADDRESS>:8080
    # 사용자 정의 Ray 메트릭도 함께 수집하세요
    extra_metrics:
      - my_custom_ray_metric

이 옵션을 구성하는 자세한 방법은 샘플 ray.d/conf.yaml 구성 파일에서 확인하세요.

검증

Agent 상태 하위 명령을 실행하고 Checks 섹션에서 ray를 찾으세요.

수집한 데이터

메트릭

ray.actors
(gauge)
Current number of actors currently in a particular state.
ray.cluster.active_nodes
(gauge)
Active nodes on the cluster
Shown as node
ray.cluster.failed_nodes
(gauge)
Failed nodes on the cluster
Shown as node
ray.cluster.pending_nodes
(gauge)
Pending nodes on the cluster
Shown as node
ray.component.cpu_percentage
(gauge)
Total CPU usage of the components on a node.
Shown as percent
ray.component.mem_shared
(gauge)
SHM usage of all components of the node. It is equivalent to the top command’s SHR column.
Shown as byte
ray.component.rss
(gauge)
RSS usage of all components on the node.
Shown as megabyte
ray.component.uss
(gauge)
USS usage of all components on the node.
Shown as megabyte
ray.gcs.actors
(gauge)
Number of actors per state {Created, Destroyed, Unresolved, Pending}
ray.gcs.placement_group
(gauge)
Number of placement groups broken down by state in {Registered, Pending, Infeasible}
ray.gcs.storage_operation.count
(count)
Number of operations invoked on Gcs storage
ray.gcs.storage_operation.latency.bucket
(count)
Time to invoke an operation on Gcs storage
Shown as millisecond
ray.gcs.storage_operation.latency.count
(count)
Time to invoke an operation on Gcs storage
ray.gcs.storage_operation.latency.sum
(count)
Time to invoke an operation on Gcs storage
Shown as millisecond
ray.gcs.task_manager.task_events.dropped
(gauge)
Number of task events dropped per type {PROFILE_EVENT, STATUS_EVENT}
Shown as event
ray.gcs.task_manager.task_events.reported
(gauge)
Number of all task events reported to gcs.
Shown as event
ray.gcs.task_manager.task_events.stored
(gauge)
Number of task events stored in GCS.
Shown as event
ray.gcs.task_manager.task_events.stored_bytes
(gauge)
Number of bytes of all task events stored in GCS.
Shown as byte
ray.grpc_server.req.finished.count
(count)
Finished request number in grpc server
Shown as request
ray.grpc_server.req.handling.count
(count)
Request number are handling in grpc server
Shown as request
ray.grpc_server.req.new.count
(count)
New request number in grpc server
Shown as request
ray.grpc_server.req.process_time
(gauge)
Request latency in grpc server
Shown as millisecond
ray.health_check.rpc_latency.bucket
(count)
Latency of rpc request for health check.
Shown as millisecond
ray.health_check.rpc_latency.count
(count)
Latency of rpc request for health check.
ray.health_check.rpc_latency.sum
(count)
Latency of rpc request for health check.
Shown as millisecond
ray.internal_num.infeasible_scheduling_classes
(gauge)
The number of unique scheduling classes that are infeasible.
ray.internal_num.processes.skipped.job_mismatch
(gauge)
The total number of cached workers skipped due to job mismatch.
Shown as process
ray.internal_num.processes.skipped.runtime_environment_mismatch
(gauge)
The total number of cached workers skipped due to runtime environment mismatch.
Shown as process
ray.internal_num.processes.started
(gauge)
The total number of worker processes the worker pool has created.
Shown as process
ray.internal_num.processes.started.from_cache
(gauge)
The total number of workers started from a cached worker process.
Shown as process
ray.internal_num.spilled_tasks
(gauge)
The cumulative number of lease requests that this raylet has spilled to other raylets.
Shown as request
ray.memory_manager.worker_eviction
(count)
The number of tasks and actors killed by the Ray Out of Memory killer broken down by types (whether it is tasks or actors) and names (name of tasks and actors).
ray.node.cpu
(gauge)
Total CPUs available on a ray node
ray.node.cpu_utilization
(gauge)
Total CPU usage on a ray node
ray.node.disk.free
(gauge)
Total disk free (bytes) on a ray node
Shown as byte
ray.node.disk.io.read
(gauge)
Total read from disk
ray.node.disk.io.read.count
(gauge)
Total read ops from disk
Shown as operation
ray.node.disk.io.read.speed
(gauge)
Disk read speed
ray.node.disk.io.write
(gauge)
Total written to disk
ray.node.disk.io.write.count
(gauge)
Total write ops to disk
ray.node.disk.io.write.speed
(gauge)
Disk write speed
ray.node.disk.read.iops
(gauge)
Disk read iops
ray.node.disk.usage
(gauge)
Total disk usage (bytes) on a ray node
Shown as byte
ray.node.disk.utilization
(gauge)
Total disk utilization (percentage) on a ray node
Shown as percent
ray.node.disk.write.iops
(gauge)
Disk write iops
ray.node.gpus_utilization
(gauge)
The GPU utilization per GPU as a percentage quantity (0..NGPU*100). GpuDeviceName is a name of a GPU device (e.g., Nvidia A10G) and GpuIndex is the index of the GPU.
Shown as percent
ray.node.gram_used
(gauge)
The amount of GPU memory used per GPU, in bytes.
Shown as byte
ray.node.mem.available
(gauge)
Memory available on a ray node
Shown as byte
ray.node.mem.shared
(gauge)
Total shared memory usage on a ray node
Shown as byte
ray.node.mem.total
(gauge)
Total memory on a ray node
Shown as byte
ray.node.mem.used
(gauge)
Memory usage on a ray node
Shown as byte
ray.node.network.receive.speed
(gauge)
Network receive speed
ray.node.network.received
(gauge)
Total network received
ray.node.network.send.speed
(gauge)
Network send speed
ray.node.network.sent
(gauge)
Total network sent
ray.object_directory.added_locations
(gauge)
Number of object locations added per second., If this is high, a lot of objects have been added on this node.
ray.object_directory.lookups
(gauge)
Number of object location lookups per second. If this is high, the raylet is waiting on a lot of objects.
ray.object_directory.removed_locations
(gauge)
Number of object locations removed per second. If this is high, a lot of objects have been removed from this node.
ray.object_directory.subscriptions
(gauge)
Number of object location subscriptions. If this is high, the raylet is attempting to pull a lot of objects.
ray.object_directory.updates
(gauge)
Number of object location updates per second., If this is high, the raylet is attempting to pull a lot of objects and/or the locations for objects are frequently changing (e.g. due to many object copies or evictions).
Shown as update
ray.object_manager.bytes
(gauge)
Number of bytes pushed or received by type {PushedFromLocalPlasma, PushedFromLocalDisk, Received}.
Shown as byte
ray.object_manager.num_pull_requests
(gauge)
Number of active pull requests for objects.
ray.object_manager.received_chunks
(gauge)
Number object chunks received broken per type {Total, FailedTotal, FailedCancelled, FailedPlasmaFull}.
ray.object_store.available_memory
(gauge)
Amount of memory currently available in the object store.
Shown as byte
ray.object_store.fallback_memory
(gauge)
Amount of memory in fallback allocations in the filesystem.
Shown as byte
ray.object_store.memory
(gauge)
Object store memory by various sub-kinds on this node
Shown as byte
ray.object_store.num_local_objects
(gauge)
Number of objects currently in the object store.
Shown as object
ray.object_store.size.bucket
(count)
The distribution of object size in bytes
Shown as byte
ray.object_store.size.count
(count)
The distribution of object size in bytes
ray.object_store.size.sum
(count)
The distribution of object size in bytes
Shown as byte
ray.object_store.used_memory
(gauge)
Amount of memory currently occupied in the object store.
Shown as byte
ray.placement_groups
(gauge)
Current number of placement groups by state. The State label (e.g., PENDING, CREATED, REMOVED) describes the state of the placement group.
ray.process.cpu_seconds.count
(count)
Total user and system CPU time spent in seconds.
Shown as second
ray.process.max_fds
(gauge)
Maximum number of open file descriptors.
Shown as file
ray.process.open_fds
(gauge)
Number of open file descriptors.
Shown as file
ray.process.resident_memory
(gauge)
Resident memory size in bytes.
Shown as byte
ray.process.start_time
(gauge)
Start time of the process since unix epoch in seconds.
Shown as second
ray.process.virtual_memory
(gauge)
Virtual memory size in bytes.
Shown as byte
ray.pull_manager.active_bundles
(gauge)
Number of active bundle requests
Shown as request
ray.pull_manager.num_object_pins
(gauge)
Number of object pin attempts by the pull manager, can be {Success, Failure}.
Shown as attempt
ray.pull_manager.object_request_time.bucket
(count)
Time between initial object pull request and local pinning of the object.
Shown as millisecond
ray.pull_manager.object_request_time.count
(count)
Time between initial object pull request and local pinning of the object.
ray.pull_manager.object_request_time.sum
(count)
Time between initial object pull request and local pinning of the object.
Shown as millisecond
ray.pull_manager.requested_bundles
(gauge)
Number of requested bundles broken per type {Get, Wait, TaskArgs}.
ray.pull_manager.requests
(gauge)
Number of pull requests broken per type {Queued, Active, Pinned}.
Shown as request
ray.pull_manager.retries_total
(gauge)
Number of cumulative pull retries.
ray.pull_manager.usage
(gauge)
The total number of bytes usage broken per type {Available, BeingPulled, Pinned}
Shown as byte
ray.push_manager.chunks
(gauge)
Number of object chunks transfer broken per type {InFlight, Remaining}.
ray.push_manager.in_flight_pushes
(gauge)
Number of in flight object push requests.
Shown as request
ray.python.gc.collections.count
(count)
Number of times this generation was collected
ray.python.gc.objects_collected.count
(count)
Objects collected during gc
Shown as object
ray.python.gc.objects_uncollectable.count
(count)
Uncollectable objects found during GC
Shown as object
ray.resources
(gauge)
Logical Ray resources broken per state {AVAILABLE, USED}
Shown as resource
ray.scheduler.failed_worker_startup
(gauge)
Number of tasks that fail to be scheduled because workers were not available. Labels are broken per reason {JobConfigMissing, RegistrationTimedOut, RateLimited}
Shown as task
ray.scheduler.placement_time.bucket
(count)
The time it takes for a workload (task, actor, placement group) to be placed. This is the time from when the tasks dependencies are resolved to when it actually reserves resources on a node to run.
Shown as second
ray.scheduler.placement_time.count
(count)
The time it takes for a workload (task, actor, placement group) to be placed. This is the time from when the tasks dependencies are resolved to when it actually reserves resources on a node to run.
ray.scheduler.placement_time.sum
(count)
The time it takes for a workload (task, actor, placement group) to be placed. This is the time from when the tasks dependencies are resolved to when it actually reserves resources on a node to run.
Shown as second
ray.scheduler.tasks
(gauge)
Number of tasks waiting for scheduling broken per state {Cancelled, Executing, Waiting, Dispatched, Received}.
Shown as task
ray.scheduler.unscheduleable_tasks
(gauge)
Number of pending tasks (not scheduleable tasks) broken per reason {Infeasible, WaitingForResources, WaitingForPlasmaMemory, WaitingForRemoteResources, WaitingForWorkers}.
Shown as task
ray.serve.deployment.error
(gauge)
The number of exceptions that have occurred in this replica.
Shown as exception
ray.serve.deployment.processing_latency.bucket
(count)
The latency for queries to be processed.
Shown as millisecond
ray.serve.deployment.processing_latency.count
(count)
The latency for queries to be processed.
ray.serve.deployment.processing_latency.sum
(count)
The latency for queries to be processed.
Shown as millisecond
ray.serve.deployment.queued_queries
(gauge)
The current number of queries to this deployment waiting to be assigned to a replica.
Shown as query
ray.serve.deployment.replica.healthy
(gauge)
Tracks whether this deployment replica is healthy. 1 means healthy, 0 means unhealthy.
ray.serve.deployment.replica.starts
(gauge)
The number of times this replica has been restarted due to failure.
ray.serve.deployment.request.counter
(gauge)
The number of queries that have been processed in this replica.
Shown as query
ray.serve.grpc_request_latency.bucket
(count)
The end-to-end latency of GRPC requests (measured from the Serve GRPC proxy).
ray.serve.grpc_request_latency.count
(count)
The end-to-end latency of GRPC requests (measured from the Serve GRPC proxy).
ray.serve.grpc_request_latency.sum
(count)
The end-to-end latency of GRPC requests (measured from the Serve GRPC proxy).
ray.serve.handle_request
(gauge)
The number of handle.remote() calls that have been made on this handle.
Shown as request
ray.serve.http_request_latency.bucket
(count)
The end-to-end latency of HTTP requests (measured from the Serve HTTP proxy).
Shown as millisecond
ray.serve.http_request_latency.count
(count)
The end-to-end latency of HTTP requests (measured from the Serve HTTP proxy).
ray.serve.http_request_latency.sum
(count)
The end-to-end latency of HTTP requests (measured from the Serve HTTP proxy).
Shown as millisecond
ray.serve.multiplexed_get_model_requests.count
(count)
The counter for get model requests on the current replica.
ray.serve.multiplexed_model_load_latency.bucket
(count)
The time it takes to load a model.
Shown as millisecond
ray.serve.multiplexed_model_load_latency.count
(count)
The time it takes to load a model.
ray.serve.multiplexed_model_load_latency.sum
(count)
The time it takes to load a model.
Shown as millisecond
ray.serve.multiplexed_model_unload_latency.bucket
(count)
The time it takes to unload a model.
Shown as millisecond
ray.serve.multiplexed_model_unload_latency.count
(count)
The time it takes to unload a model.
ray.serve.multiplexed_model_unload_latency.sum
(count)
The time it takes to unload a model.
Shown as millisecond
ray.serve.multiplexed_models_load.count
(count)
The counter for loaded models on the current replica.
ray.serve.multiplexed_models_unload.count
(count)
The counter for unloaded models on the current replica.
ray.serve.num_deployment_grpc_error_requests
(gauge)
The number of errored GRPC responses returned by each deployment.
ray.serve.num_deployment_http_error_requests
(gauge)
The number of non-200 HTTP responses returned by each deployment.
Shown as response
ray.serve.num_grpc_error_requests
(gauge)
The number of errored GRPC responses.
ray.serve.num_grpc_requests
(gauge)
The number of GRPC responses.
ray.serve.num_http_error_requests
(gauge)
The number of non-200 HTTP responses.
Shown as response
ray.serve.num_http_requests
(gauge)
The number of HTTP requests processed.
Shown as request
ray.serve.num_multiplexed_models
(gauge)
The number of models loaded on the current replica.
ray.serve.num_router_requests
(gauge)
The number of requests processed by the router.
Shown as request
ray.serve.registered_multiplexed_model_id
(gauge)
The model id registered on the current replica.
ray.serve.replica.pending_queries
(gauge)
The current number of pending queries.
Shown as query
ray.serve.replica.processing_queries
(gauge)
The current number of queries being processed.
Shown as query
ray.server.num_ongoing_grpc_requests
(gauge)
The number of ongoing requests in this GRPC proxy.
ray.server.num_ongoing_http_requests
(gauge)
The number of ongoing requests in this HTTP proxy.
ray.server.num_scheduling_tasks
(gauge)
The number of request scheduling tasks in the router.
ray.server.num_scheduling_tasks_in_backoff
(gauge)
The number of request scheduling tasks in the router that are undergoing backoff.
ray.spill_manager.objects
(gauge)
Number of local objects broken per state {Pinned, PendingRestore, PendingSpill}.
Shown as object
ray.spill_manager.objects_size
(gauge)
Byte size of local objects broken per state {Pinned, PendingSpill}.
Shown as byte
ray.spill_manager.request_total
(gauge)
Number of {spill, restore} requests.
Shown as request
ray.tasks
(gauge)
Current number of tasks currently in a particular state.
Shown as task
ray.unintentional_worker_failures.count
(count)
Number of worker failures that are not intentional. For example, worker failures due to system related errors.
Shown as error
ray.worker.register_time.bucket
(count)
End to end latency of register a worker process.
Shown as millisecond
ray.worker.register_time.count
(count)
End to end latency of register a worker process.
ray.worker.register_time.sum
(count)
End to end latency of register a worker process.
Shown as millisecond

이벤트

Ray 통합은 이벤트를 포함하지 않습니다.

서비스 점검

ray.openmetrics.health

Returns CRITICAL if the check cannot access the openmetrics metrics endpoint of Ray.

Statuses: ok, critical

로그

Ray 통합은 Ray 서비스에서 로그를 수집하여 Datadog으로 전달할 수 있습니다.

  1. Datadog Agent에서 로그 수집은 기본적으로 비활성화되어 있으므로 datadog.yaml 파일에서 활성화합니다.

    logs_enabled: true
    
  2. ray.d/conf.yaml 파일에서 로그 구성 블록의 주석 처리를 제거하고 편집하세요. 예를 들면 다음과 같습니다.

    logs:
      - type: file
        path: /tmp/ray/session_latest/logs/dashboard.log
        source: ray
        service: ray
      - type: file
        path: /tmp/ray/session_latest/logs/gcs_server.out
        source: ray
        service: ray
    

Datadog Agent에서는 로그 수집 기능이 기본적으로 비활성화되어 있습니다. 활성화하려면 Kubernetes 로그 수집을 참고하세요.

그런 다음 로그 통합을 포드 애노테이션으로 설정합니다. 파일, 구성 맵, 키-값 저장소를 사용하여 구성할 수도 있습니다. 자세한 내용은 Kubernetes 로그 수집의 구성 섹션을 참고하세요.

주석 v1/v2

apiVersion: v1
kind: Pod
metadata:
  name: ray
  annotations:
    ad.datadoghq.com/apache.logs: '[{"source":"ray","service":"ray"}]'
spec:
  containers:
    - name: ray

Ray의 로깅 구성 및 모든 로그 파일에 대한 자세한 내용은 Ray 공식 설명서를 참고하세요.

트러블슈팅

도움이 필요하신가요? Datadog 지원팀에 문의하세요.