vLLM

Docs > インテグレーション > vLLM

Supported OS Linux Windows Mac OS

インテグレーションバージョン2.3.0

概要

このチェックは、Datadog Agent を通じて vLLM を監視します。

セットアップ

以下の手順に従って、このチェックをインストールし、ホストで実行中の Agent に対して構成します。

インストール

vLLM チェックは Datadog Agent パッケージに含まれています。サーバー上での追加インストールは不要です。

設定

vllm のパフォーマンスデータの収集を開始するには、Agent の構成ディレクトリのルートにある conf.d/ フォルダーの vllm.d/conf.yaml ファイルを編集します。使用可能なすべての構成オプションの詳細については、サンプル vllm.d/conf.yaml を参照してください。
Agent を再起動します。

検証

Agent の status サブコマンドを実行し、Checks セクションで vllm を探します。

収集データ

メトリクス


vllm.avg.generation_throughput.toks_per_s (gauge)	Average generation throughput in tokens/s
vllm.avg.prompt.throughput.toks_per_s (gauge)	Average prefill throughput in tokens/s
vllm.cache_config_info (gauge)	Information on cache config
vllm.cpu_cache_usage_perc (gauge)	CPU KV-cache usage. 1 means 100 percent usage Shown as percent
vllm.e2e_request_latency.seconds.bucket (count)	The observations of end to end request latency bucketed by seconds.
vllm.e2e_request_latency.seconds.count (count)	The total number of observations of end to end request latency.
vllm.e2e_request_latency.seconds.sum (count)	The sum of end to end request latency in seconds. Shown as second
vllm.generation_tokens.count (count)	Number of generation tokens processed.
vllm.gpu_cache_usage_perc (gauge)	GPU KV-cache usage. 1 means 100 percent usage Shown as percent
vllm.num_preemptions.count (count)	Cumulative number of preemption from the engine.
vllm.num_requests.running (gauge)	Number of requests currently running on GPU.
vllm.num_requests.swapped (gauge)	Number of requests swapped to CPU.
vllm.num_requests.waiting (gauge)	Number of requests waiting.
vllm.process.cpu_seconds.count (count)	Total user and system CPU time spent in seconds. Shown as second
vllm.process.max_fds (gauge)	Maximum number of open file descriptors. Shown as file
vllm.process.open_fds (gauge)	Number of open file descriptors. Shown as file
vllm.process.resident_memory_bytes (gauge)	Resident memory size in bytes. Shown as byte
vllm.process.start_time_seconds (gauge)	Start time of the process since unix epoch in seconds. Shown as second
vllm.process.virtual_memory_bytes (gauge)	Virtual memory size in bytes. Shown as byte
vllm.prompt_tokens.count (count)	Number of prefill tokens processed.
vllm.python.gc.collections.count (count)	Number of times this generation was collected
vllm.python.gc.objects.collected.count (count)	Objects collected during gc
vllm.python.gc.objects.uncollectable.count (count)	Uncollectable objects found during GC
vllm.python.info (gauge)	Python platform information
vllm.request.generation_tokens.bucket (count)	Number of generation tokens processed.
vllm.request.generation_tokens.count (count)	Number of generation tokens processed.
vllm.request.generation_tokens.sum (count)	Number of generation tokens processed.
vllm.request.params.best_of.bucket (count)	Histogram of the best_of request parameter.
vllm.request.params.best_of.count (count)	Histogram of the best_of request parameter.
vllm.request.params.best_of.sum (count)	Histogram of the best_of request parameter.
vllm.request.params.n.bucket (count)	Histogram of the n request parameter.
vllm.request.params.n.count (count)	Histogram of the n request parameter.
vllm.request.params.n.sum (count)	Histogram of the n request parameter.
vllm.request.prompt_tokens.bucket (count)	Number of prefill tokens processed.
vllm.request.prompt_tokens.count (count)	Number of prefill tokens processed.
vllm.request.prompt_tokens.sum (count)	Number of prefill tokens processed.
vllm.request.success.count (count)	Count of successfully processed requests.
vllm.time_per_output_token.seconds.bucket (count)	The observations of time per output token bucketed by seconds.
vllm.time_per_output_token.seconds.count (count)	The total number of observations of time per output token.
vllm.time_per_output_token.seconds.sum (count)	The sum of time per output token in seconds. Shown as second
vllm.time_to_first_token.seconds.bucket (count)	The observations of time to first token bucketed by seconds.
vllm.time_to_first_token.seconds.count (count)	The total number of observations of time to first token.
vllm.time_to_first_token.seconds.sum (count)	The sum of time to first token in seconds. Shown as second

イベント

vLLM インテグレーションにはイベントは含まれません。

サービスチェック

The vLLM integration does not include any service checks.

vllm.openmetrics.health

Returns CRITICAL if the Agent is unable to connect to the vLLM OpenMetrics endpoint, otherwise returns OK.

Statuses: ok, critical

ログ

ログの収集は Datadog Agent ではデフォルトで無効です。Agent をコンテナとして実行している場合は、コンテナのインストールを参照してログの収集を有効にしてください。ホスト Agent を実行している場合は、代わりにホスト Agent を参照してください。いずれの場合も、ログの source の値が vllm であることを確認してください。この設定により、組み込みの処理パイプラインがログを確実に検出します。コンテナのログ構成を設定するには、ログインテグレーションを参照してください。

トラブルシューティング

ご不明な点は、Datadog のサポートチームまでお問合せください。

その他の参考資料

お役に立つドキュメント、リンクや記事:

Datadog の vLLM インテグレーションで LLM アプリケーションのパフォーマンスを最適化