Supported Model Servers¶
Any model server that conform to the model server protocol are supported by the inference extension.
Compatible Model Server Versions¶
| Model Server | Version | Commit | Notes |
|---|---|---|---|
| vLLM V0 | v0.6.4 and above | commit 0ad216f | |
| vLLM V1 | v0.8.0 and above | commit bc32bc7 | |
| Triton(TensorRT-LLM) | 25.03 and above | commit 15cb989. | LoRA affinity feature is not available as the required LoRA metrics haven't been implemented in Triton yet. Feature request |
vLLM¶
vLLM is configured as the default in the endpoint picker extension. No further configuration is required.
Triton with TensorRT-LLM Backend¶
Triton specific metric names need to be specified when starting the EPP.
Option 1: Use Helm¶
Use --set inferencePool.modelServerType=triton-tensorrt-llm to install the inferencepool via helm. See the inferencepool helm guide for more details.
Option 2: Edit EPP deployment yaml¶
Add the following to the args of the EPP deployment
- -totalQueuedRequestsMetric
- "nv_trt_llm_request_metrics{request_type=waiting}"
- -kvCacheUsagePercentageMetric
- "nv_trt_llm_kv_cache_block_metrics{kv_cache_block_type=fraction}"
- -loraInfoMetric
- "" # Set an empty metric to disable LoRA metric scraping as they are not supported by Triton yet.