v0.5.1
版本发布时间: 2024-07-06 03:47:01
vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)
Highlights
- vLLM now has pipeline parallelism! (#4412, #5408, #6115, #6120). You can now run the API server with
--pipeline-parallel-size
. This feature is in early stage, please let us know your feedback.
Model Support
- Support Gemma 2 (#5908, #6051). Please note that for correctness, Gemma should run with FlashInfer backend which supports logits soft cap. The wheels for FlashInfer can be downloaded here
- Support Jamba (#4115). This is vLLM's first state space model!
- Support Deepseek-V2 (#4650). Please note that MLA (Multi-head Latent Attention) is not implemented and we are looking for contribution!
- Vision Language Model adding support for Phi3-Vision, dynamic image size, and a registry for processing model inputs (#4986, #5276, #5214)
- Notably, it has a breaking change that all VLM specific arguments are now removed from engine APIs so you no longer need to set it globally via CLI. However, you now only need to pass in
<image>
into the prompt instead of complicated prompt formatting. See more here - There is also a new guide on adding VLMs! We would love your contribution for new models!
- Notably, it has a breaking change that all VLM specific arguments are now removed from engine APIs so you no longer need to set it globally via CLI. However, you now only need to pass in
Hardware Support
- Enhancement to TPU support (#5292, #5878, #5850, #5831, #5855)
- OpenVINO backend (#5379)
Production Service
- Support for sharded tensorized models (#4990)
- Continous streaming of OpenAI response token stats (#5742)
Performance
- Enhancement in distributed communication via shared memory (#5399)
- Latency enhancement in block manager (#5584)
- Enhancements to
compressed-tensors
supporting Marlin, W4A16 (#5435, #5385) - Faster FP8 quantize kernel (#5396), FP8 on Ampere (#5975)
- Option to use FlashInfer for prefill, decode, and CUDA Graph for decode (#4628)
- Speculative Decoding
- MLPSpeculator (#4947, #6050)
- Typical Acceptance Sampler (#5131, #5348)
- Draft Model Runner (#5799)
Development Productivity
- Post merge benchmark is now available at perf.vllm.ai!
- Addition of A100 in CI environment (#5658)
- Step towards nightly wheel publication (#5610)
What's Changed
- [CI/Build] Add
is_quant_method_supported
to control quantization test configurations by @mgoin in https://github.com/vllm-project/vllm/pull/5253 - Revert "[CI/Build] Add
is_quant_method_supported
to control quantization test configurations" by @simon-mo in https://github.com/vllm-project/vllm/pull/5463 - [CI] Upgrade codespell version. by @rkooo567 in https://github.com/vllm-project/vllm/pull/5381
- [Hardware] Initial TPU integration by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5292
- [Bugfix] Add device assertion to TorchSDPA by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/5402
- [ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests by @khluu in https://github.com/vllm-project/vllm/pull/5464
- [Kernel] Vectorized FP8 quantize kernel by @comaniac in https://github.com/vllm-project/vllm/pull/5396
- [Bugfix] TYPE_CHECKING for MultiModalData by @kimdwkimdw in https://github.com/vllm-project/vllm/pull/5444
- [Frontend] [Core] Support for sharded tensorized models by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/4990
- [misc] add hint for AttributeError by @youkaichao in https://github.com/vllm-project/vllm/pull/5462
- [Doc] Update debug docs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5438
- [Bugfix] Fix typo in scheduler.py (requeset -> request) by @mgoin in https://github.com/vllm-project/vllm/pull/5470
- [Frontend] Add "input speed" to tqdm postfix alongside output speed by @mgoin in https://github.com/vllm-project/vllm/pull/5425
- [Bugfix] Fix wrong multi_modal_input format for CPU runner by @Isotr0py in https://github.com/vllm-project/vllm/pull/5451
- [Core][Distributed] add coordinator to reduce code duplication in tp and pp by @youkaichao in https://github.com/vllm-project/vllm/pull/5293
- [ci] Use sccache to build images by @khluu in https://github.com/vllm-project/vllm/pull/5419
- [Bugfix]if the content is started with ":"(response of ping), client should i… by @sywangyi in https://github.com/vllm-project/vllm/pull/5303
- [Kernel]
w4a16
support forcompressed-tensors
by @dsikka in https://github.com/vllm-project/vllm/pull/5385 - [CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations by @mgoin in https://github.com/vllm-project/vllm/pull/5466
- [Kernel] Tune Qwen2MoE kernel configurations with tp2,4 by @wenyujin333 in https://github.com/vllm-project/vllm/pull/5497
- [Hardware][Intel] Optimize CPU backend and add more performance tips by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4971
- [Docs] Add 4th meetup slides by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5509
- [Misc] Add vLLM version getter to utils by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5098
- [CI/Build] Simplify OpenAI server setup in tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5100
- [Doc] Update LLaVA docs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5437
- [Kernel] Factor out epilogues from cutlass kernels by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5391
- [MISC] Remove FP8 warning by @comaniac in https://github.com/vllm-project/vllm/pull/5472
- Seperate dev requirements into lint and test by @Yard1 in https://github.com/vllm-project/vllm/pull/5474
- Revert "[Core] Remove unnecessary copies in flash attn backend" by @Yard1 in https://github.com/vllm-project/vllm/pull/5478
- [misc] fix format.sh by @youkaichao in https://github.com/vllm-project/vllm/pull/5511
- [CI/Build] Disable test_fp8.py by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5508
- [Kernel] Disable CUTLASS kernels for fp8 by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5505
- Add
cuda_device_count_stateless
by @Yard1 in https://github.com/vllm-project/vllm/pull/5473 - [Hardware][Intel] Support CPU inference with AVX2 ISA by @DamonFool in https://github.com/vllm-project/vllm/pull/5452
- [Bugfix]typofix by @AllenDou in https://github.com/vllm-project/vllm/pull/5507
- bump version to v0.5.0.post1 by @simon-mo in https://github.com/vllm-project/vllm/pull/5522
- [CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with
perf-benchmarks
label by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5073 - [CI/Build] Disable LLaVA-NeXT CPU test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5529
- [Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5516
- [Misc] Fix arg names by @AllenDou in https://github.com/vllm-project/vllm/pull/5524
- [ Misc ] Rs/compressed tensors cleanup by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5432
- [Kernel] Suppress mma.sp warning on CUDA 12.5 and later by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5401
- [mis] fix flaky test of test_cuda_device_count_stateless by @youkaichao in https://github.com/vllm-project/vllm/pull/5546
- [Core] Remove duplicate processing in async engine by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5525
- [misc][distributed] fix benign error in
is_in_the_same_node
by @youkaichao in https://github.com/vllm-project/vllm/pull/5512 - [Docs] Add ZhenFund as a Sponsor by @simon-mo in https://github.com/vllm-project/vllm/pull/5548
- [Doc] Update documentation on Tensorizer by @sangstar in https://github.com/vllm-project/vllm/pull/5471
- [Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by @tdoublep in https://github.com/vllm-project/vllm/pull/5460
- [Bugfix] Fix typo in Pallas backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5558
- [Core][Distributed] improve p2p cache generation by @youkaichao in https://github.com/vllm-project/vllm/pull/5528
- Add ccache to amd by @simon-mo in https://github.com/vllm-project/vllm/pull/5555
- [Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in https://github.com/vllm-project/vllm/pull/5364
- [mypy] Enable type checking for test directory by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5017
- [CI/Build] Test both text and token IDs in batched OpenAI Completions API by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5568
- [misc] Do not allow to use lora with chunked prefill. by @rkooo567 in https://github.com/vllm-project/vllm/pull/5538
- add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5145
- [BugFix] Don't start a Ray cluster when not using Ray by @njhill in https://github.com/vllm-project/vllm/pull/5570
- [Fix] Correct OpenAI batch response format by @zifeitong in https://github.com/vllm-project/vllm/pull/5554
- Add basic correctness 2 GPU tests to 4 GPU pipeline by @Yard1 in https://github.com/vllm-project/vllm/pull/5518
- [CI][BugFix] Flip is_quant_method_supported condition by @mgoin in https://github.com/vllm-project/vllm/pull/5577
- [build][misc] limit numpy version by @youkaichao in https://github.com/vllm-project/vllm/pull/5582
- [Doc] add debugging tips for crash and multi-node debugging by @youkaichao in https://github.com/vllm-project/vllm/pull/5581
- Fix w8a8 benchmark and add Llama-3-8B by @comaniac in https://github.com/vllm-project/vllm/pull/5562
- [Model] Rename Phi3 rope scaling type by @garg-amit in https://github.com/vllm-project/vllm/pull/5595
- Correct alignment in the seq_len diagram. by @CharlesRiggins in https://github.com/vllm-project/vllm/pull/5592
- [Kernel]
compressed-tensors
marlin 24 support by @dsikka in https://github.com/vllm-project/vllm/pull/5435 - [Misc] use AutoTokenizer for benchmark serving when vLLM not installed by @zhyncs in https://github.com/vllm-project/vllm/pull/5588
- [Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend by @jikunshang in https://github.com/vllm-project/vllm/pull/3814
- [CI/BUILD] Support non-AVX512 vLLM building and testing by @DamonFool in https://github.com/vllm-project/vllm/pull/5574
- [CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5571
- [bugfix][distributed] fix 16 gpus local rank arrangement by @youkaichao in https://github.com/vllm-project/vllm/pull/5604
- [Optimization] use a pool to reuse LogicalTokenBlock.token_ids by @youkaichao in https://github.com/vllm-project/vllm/pull/5584
- [Bugfix] Fix KV head calculation for MPT models when using GQA by @bfontain in https://github.com/vllm-project/vllm/pull/5142
- [Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py by @zifeitong in https://github.com/vllm-project/vllm/pull/5606
- [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier by @sroy745 in https://github.com/vllm-project/vllm/pull/5131
- [Model] Initialize Phi-3-vision support by @Isotr0py in https://github.com/vllm-project/vllm/pull/4986
- [Kernel] Add punica dimensions for Granite 13b by @joerunde in https://github.com/vllm-project/vllm/pull/5559
- [misc][typo] fix typo by @youkaichao in https://github.com/vllm-project/vllm/pull/5620
- [Misc] Fix typo by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5618
- [CI] Avoid naming different metrics with the same name in performance benchmark by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5615
- [bugfix][distributed] do not error if two processes do not agree on p2p capability by @youkaichao in https://github.com/vllm-project/vllm/pull/5612
- [Misc] Remove import from transformers logging by @CatherineSue in https://github.com/vllm-project/vllm/pull/5625
- [CI/Build][Misc] Update Pytest Marker for VLMs by @ywang96 in https://github.com/vllm-project/vllm/pull/5623
- [ci] Deprecate original CI template by @khluu in https://github.com/vllm-project/vllm/pull/5624
- [Misc] Add OpenTelemetry support by @ronensc in https://github.com/vllm-project/vllm/pull/4687
- [Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization by @dsikka in https://github.com/vllm-project/vllm/pull/5542
- [ci] Setup Release pipeline and build release wheels with cache by @khluu in https://github.com/vllm-project/vllm/pull/5610
- [Model] LoRA support added for command-r by @sergey-tinkoff in https://github.com/vllm-project/vllm/pull/5178
- [Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties by @tdoublep in https://github.com/vllm-project/vllm/pull/5639
- [Doc] Added cerebrium as Integration option by @milo157 in https://github.com/vllm-project/vllm/pull/5553
- [Bugfix] Fix CUDA version check for mma warning suppression by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5642
- [Bugfix] Fix w8a8 benchmarks for int8 case by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5643
- [Bugfix] Fix Phi-3 Long RoPE scaling implementation by @ShukantPal in https://github.com/vllm-project/vllm/pull/5628
- [Bugfix] Added test for sampling repetition penalty bug. by @tdoublep in https://github.com/vllm-project/vllm/pull/5659
- [Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices by @hongxiayang in https://github.com/vllm-project/vllm/pull/5641
- [misc][distributed] use localhost for single-node by @youkaichao in https://github.com/vllm-project/vllm/pull/5619
- [Model] Add FP8 kv cache for Qwen2 by @mgoin in https://github.com/vllm-project/vllm/pull/5656
- [Bugfix] Fix sampling_params passed incorrectly in Phi3v example by @Isotr0py in https://github.com/vllm-project/vllm/pull/5684
- [Misc]Add param max-model-len in benchmark_latency.py by @DearPlanet in https://github.com/vllm-project/vllm/pull/5629
- [CI/Build] Add tqdm to dependencies by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5680
- [ci] Add A100 queue into AWS CI template by @khluu in https://github.com/vllm-project/vllm/pull/5648
- [Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py by @mgoin in https://github.com/vllm-project/vllm/pull/5688
- [ci][distributed] add tests for custom allreduce by @youkaichao in https://github.com/vllm-project/vllm/pull/5689
- [Bugfix] AsyncLLMEngine hangs with asyncio.run by @zifeitong in https://github.com/vllm-project/vllm/pull/5654
- [Doc] Update docker references by @rafvasq in https://github.com/vllm-project/vllm/pull/5614
- [Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes by @dsikka in https://github.com/vllm-project/vllm/pull/5650
- [ci] Limit num gpus if specified for A100 by @khluu in https://github.com/vllm-project/vllm/pull/5694
- [Misc] Improve conftest by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5681
- [Bugfix][Doc] FIx Duplicate Explicit Target Name Errors by @ywang96 in https://github.com/vllm-project/vllm/pull/5703
- [Kernel] Update Cutlass int8 kernel configs for SM90 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5514
- [Model] Port over CLIPVisionModel for VLMs by @ywang96 in https://github.com/vllm-project/vllm/pull/5591
- [Kernel] Update Cutlass int8 kernel configs for SM80 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5275
- [Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5715
- [Frontend] Add FlexibleArgumentParser to support both underscore and dash in names by @mgoin in https://github.com/vllm-project/vllm/pull/5718
- [distributed][misc] use fork by default for mp by @youkaichao in https://github.com/vllm-project/vllm/pull/5669
- [Model] MLPSpeculator speculative decoding support by @JRosenkranz in https://github.com/vllm-project/vllm/pull/4947
- [Kernel] Add punica dimension for Qwen2 LoRA by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/5441
- [BugFix] Fix test_phi3v.py by @CatherineSue in https://github.com/vllm-project/vllm/pull/5725
- [Bugfix] Add fully sharded layer for QKVParallelLinearWithLora by @jeejeelee in https://github.com/vllm-project/vllm/pull/5665
- [Core][Distributed] add shm broadcast by @youkaichao in https://github.com/vllm-project/vllm/pull/5399
- [Kernel][CPU] Add Quick
gelu
to CPU by @ywang96 in https://github.com/vllm-project/vllm/pull/5717 - [Doc] Documentation on supported hardware for quantization methods by @mgoin in https://github.com/vllm-project/vllm/pull/5745
- [BugFix] exclude version 1.15.0 for modelscope by @zhyncs in https://github.com/vllm-project/vllm/pull/5668
- [ci][test] fix ca test in main by @youkaichao in https://github.com/vllm-project/vllm/pull/5746
- [LoRA] Add support for pinning lora adapters in the LRU cache by @rohithkrn in https://github.com/vllm-project/vllm/pull/5603
- [CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline by @jikunshang in https://github.com/vllm-project/vllm/pull/5616
- [Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs by @DamonFool in https://github.com/vllm-project/vllm/pull/5710
- [Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py by @zifeitong in https://github.com/vllm-project/vllm/pull/5756
- [Bugfix] Fix pin_lora error in TPU executor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5760
- [Docs][TPU] Add installation tip for TPU by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5761
- [core][distributed] improve shared memory broadcast by @youkaichao in https://github.com/vllm-project/vllm/pull/5754
- [BugFix] [Kernel] Add Cutlass2x fallback kernels by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5744
- [Distributed] Add send and recv helpers by @andoorve in https://github.com/vllm-project/vllm/pull/5719
- [Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement by @Isotr0py in https://github.com/vllm-project/vllm/pull/5772
- [doc][faq] add warning to download models for every nodes by @youkaichao in https://github.com/vllm-project/vllm/pull/5783
- [Doc] Add "Suggest edit" button to doc pages by @mgoin in https://github.com/vllm-project/vllm/pull/5789
- [Doc] Add Phi-3-medium to list of supported models by @mgoin in https://github.com/vllm-project/vllm/pull/5788
- [Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args by @CatherineSue in https://github.com/vllm-project/vllm/pull/5795
- [ci] Remove aws template by @khluu in https://github.com/vllm-project/vllm/pull/5757
- [Doc] Add notice about breaking changes to VLMs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5818
- [Speculative Decoding] Support draft model on different tensor-parallel size than target model by @wooyeonlee0 in https://github.com/vllm-project/vllm/pull/5414
- [Misc] Remove useless code in cpu_worker by @DamonFool in https://github.com/vllm-project/vllm/pull/5824
- [Core] Add fault tolerance for
RayTokenizerGroupPool
by @Yard1 in https://github.com/vllm-project/vllm/pull/5748 - [doc][distributed] add both gloo and nccl tests by @youkaichao in https://github.com/vllm-project/vllm/pull/5834
- [CI/Build] Add unit testing for FlexibleArgumentParser by @mgoin in https://github.com/vllm-project/vllm/pull/5798
- [Misc] Update
w4a16
compressed-tensors
support to includew8a16
by @dsikka in https://github.com/vllm-project/vllm/pull/5794 - [Hardware][TPU] Refactor TPU backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5831
- [Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes by @mawong-amd in https://github.com/vllm-project/vllm/pull/5422
- [Hardware][TPU] Raise errors for unsupported sampling params by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5850
- [CI/Build] Add E2E tests for MLPSpeculator by @tdoublep in https://github.com/vllm-project/vllm/pull/5791
- [Bugfix] Fix assertion in NeuronExecutor by @aws-patlange in https://github.com/vllm-project/vllm/pull/5841
- [Core] Refactor Worker and ModelRunner to consolidate control plane communication by @stephanie-wang in https://github.com/vllm-project/vllm/pull/5408
- [Misc][Doc] Add Example of using OpenAI Server with VLM by @ywang96 in https://github.com/vllm-project/vllm/pull/5832
- [bugfix][distributed] fix shm broadcast when the queue size is full by @youkaichao in https://github.com/vllm-project/vllm/pull/5801
- [Bugfix] Fix embedding to support 2D inputs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5829
- [Bugfix][TPU] Fix KV cache size calculation by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5860
- [CI/Build] Refactor image test assets by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5821
- [Kernel] Adding bias epilogue support for
cutlass_scaled_mm
by @ProExpertProg in https://github.com/vllm-project/vllm/pull/5560 - [Frontend] Add tokenize/detokenize endpoints by @sasha0552 in https://github.com/vllm-project/vllm/pull/5054
- [Hardware][TPU] Support parallel sampling & Swapping by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5855
- [Bugfix][TPU] Fix CPU cache allocation by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5869
- Support CPU inference with VSX PowerPC ISA by @ChipKerchner in https://github.com/vllm-project/vllm/pull/5652
- [doc] update usage of env var to avoid conflict by @youkaichao in https://github.com/vllm-project/vllm/pull/5873
- [Misc] Add example for LLaVA-NeXT by @ywang96 in https://github.com/vllm-project/vllm/pull/5879
- [BugFix] Fix cuda graph for MLPSpeculator by @njhill in https://github.com/vllm-project/vllm/pull/5875
- [Doc] Add note about context length in Phi-3-Vision example by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5887
- [VLM][Bugfix] Make sure that
multi_modal_kwargs
is broadcasted properly by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5880 - [Model] Add base class for LoRA-supported models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5018
- [Bugfix] Fix img_sizes Parsing in Phi3-Vision by @ywang96 in https://github.com/vllm-project/vllm/pull/5888
- [CI/Build] [1/3] Reorganize entrypoints tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5526
- [Model][Bugfix] Implicit model flags and reenable Phi-3-Vision by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5896
- [doc][misc] add note for Kubernetes users by @youkaichao in https://github.com/vllm-project/vllm/pull/5916
- [BugFix] Fix
MLPSpeculator
handling ofnum_speculative_tokens
by @njhill in https://github.com/vllm-project/vllm/pull/5876 - [BugFix] Fix
min_tokens
behaviour for multiple eos tokens by @njhill in https://github.com/vllm-project/vllm/pull/5849 - [CI/Build] Fix Args for
_get_logits_warper
in Sampler Test by @ywang96 in https://github.com/vllm-project/vllm/pull/5922 - [Model] Add Gemma 2 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5908
- [core][misc] remove logical block by @youkaichao in https://github.com/vllm-project/vllm/pull/5882
- [Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X by @divakar-amd in https://github.com/vllm-project/vllm/pull/5932
- [Hardware][TPU] Optimize KV cache swapping by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5878
- [VLM][BugFix] Make sure that
multi_modal_kwargs
can broadcast properly with ring buffer. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5905 - [Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner by @Isotr0py in https://github.com/vllm-project/vllm/pull/5956
- [Core] Registry for processing model inputs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5214
- Unmark fused_moe config json file as executable by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5960
- [Hardware][Intel] OpenVINO vLLM backend by @ilya-lavrenov in https://github.com/vllm-project/vllm/pull/5379
- [Bugfix] Better error message for MLPSpeculator when
num_speculative_tokens
is set too high by @tdoublep in https://github.com/vllm-project/vllm/pull/5894 - [CI/Build] [2/3] Reorganize entrypoints tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5904
- [Distributed] Make it clear that % should not be in tensor dict keys. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5927
- [Spec Decode] Introduce DraftModelRunner by @comaniac in https://github.com/vllm-project/vllm/pull/5799
- [Bugfix] Fix compute datatype for cutlass 3.x epilogues by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5931
- [ Misc ] Remove
fp8_shard_indexer
from Col/Row Parallel Linear (Simplify Weight Loading) by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5928 - [ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5921
- Support Deepseek-V2 by @zwd003 in https://github.com/vllm-project/vllm/pull/4650
- [Bugfix] Only add
Attention.kv_scale
if kv cache quantization is enabled by @mgoin in https://github.com/vllm-project/vllm/pull/5936 - Unmark more files as executable by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5962
- [Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5963
- [Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/4628
- [Bugfix][TPU] Fix TPU sampler output by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5978
- [Bugfix][TPU] Fix pad slot id by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5977
- [Bugfix] fix missing last itl in openai completions benchmark by @mcalman in https://github.com/vllm-project/vllm/pull/5926
- [Misc] Extend vLLM Metrics logging API by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/5925
- [Kernel] Add punica dimensions for Granite 3b and 8b by @joerunde in https://github.com/vllm-project/vllm/pull/5930
- [Bugfix] Fix precisions in Gemma 1 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5913
- [Misc] Update Phi-3-Vision Example by @ywang96 in https://github.com/vllm-project/vllm/pull/5981
- [Bugfix] Support
eos_token_id
fromconfig.json
by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5954 - [Core] Optimize
SequenceStatus.is_finished
by switching to IntEnum by @Yard1 in https://github.com/vllm-project/vllm/pull/5974 - [Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k by @comaniac in https://github.com/vllm-project/vllm/pull/5939
- [ CI/Build ] Added E2E Test For Compressed Tensors by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5839
- [CI/Build] Add TP test for vision models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5892
- [ CI/Build ] LM Eval Harness Based CI Testing by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5838
- [Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests by @mawong-amd in https://github.com/vllm-project/vllm/pull/5949
- [CI/Build] Temporarily Remove Phi3-Vision from TP Test by @ywang96 in https://github.com/vllm-project/vllm/pull/5989
- [CI/Build] Reuse code for checking output consistency by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5988
- [CI/Build] [3/3] Reorganize entrypoints tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5966
- [ci][distributed] fix some cuda init that makes it necessary to use spawn by @youkaichao in https://github.com/vllm-project/vllm/pull/5991
- [Frontend]: Support base64 embedding by @llmpros in https://github.com/vllm-project/vllm/pull/5935
- [Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. by @rkooo567 in https://github.com/vllm-project/vllm/pull/5909
- [ CI ] Temporarily Disable Large LM-Eval Tests by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6005
- [Misc] Fix
get_min_capability
by @dsikka in https://github.com/vllm-project/vllm/pull/5971 - [ Misc ] Refactor w8a8 to use
process_weights_after_load
(Simplify Weight Loading) by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5940 - [misc][cuda] use nvml query to avoid accidentally cuda initialization by @youkaichao in https://github.com/vllm-project/vllm/pull/6007
- [Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker by @sroy745 in https://github.com/vllm-project/vllm/pull/5348
- [ CI ] Re-enable Large Model LM Eval by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6031
- [doc][misc] remove deprecated api server in doc by @youkaichao in https://github.com/vllm-project/vllm/pull/6037
- [Misc] update benchmark backend for scalellm by @zhyncs in https://github.com/vllm-project/vllm/pull/6018
- [doc][misc] further lower visibility of simple api server by @youkaichao in https://github.com/vllm-project/vllm/pull/6041
- [Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool by @Yard1 in https://github.com/vllm-project/vllm/pull/6039
- [Bugfix] adding chunking mechanism to fused_moe to handle large inputs by @avshalomman in https://github.com/vllm-project/vllm/pull/6029
- add FAQ doc under 'serving' by @llmpros in https://github.com/vllm-project/vllm/pull/5946
- [Bugfix][Doc] Fix Doc Formatting by @ywang96 in https://github.com/vllm-project/vllm/pull/6048
- [Bugfix] Add explicit
end_forward
calls to flashinfer by @Yard1 in https://github.com/vllm-project/vllm/pull/6044 - [BugFix] Ensure worker model loop is always stopped at the right time by @njhill in https://github.com/vllm-project/vllm/pull/5987
- [Frontend] Relax api url assertion for openai benchmarking by @jamestwhedbee in https://github.com/vllm-project/vllm/pull/6046
- [Model] Changes to MLPSpeculator to support tie_weights and input_scale by @tdoublep in https://github.com/vllm-project/vllm/pull/5965
- [Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5602
- [Frontend] Add template related params to request by @danieljannai21 in https://github.com/vllm-project/vllm/pull/5709
- [VLM] Remove
image_input_type
from VLM config by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5852 - [Doc] Reinstate doc dependencies by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6061
- [Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) by @sirejdua in https://github.com/vllm-project/vllm/pull/6050
- [Core] Pipeline Parallel Support by @andoorve in https://github.com/vllm-project/vllm/pull/4412
- Update conftest.py by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6076
- [ Misc ] Refactor MoE to isolate Fp8 From Mixtral by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5970
- [CORE] Quantized lm-head Framework by @Qubitium in https://github.com/vllm-project/vllm/pull/4442
- [Model] Jamba support by @mzusman in https://github.com/vllm-project/vllm/pull/4115
- [hardware][misc] introduce platform abstraction by @youkaichao in https://github.com/vllm-project/vllm/pull/6080
- [Core] Dynamic image size support for VLMs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5276
- [CI] Fix base url doesn't strip "/" by @rkooo567 in https://github.com/vllm-project/vllm/pull/6087
- [BugFix] Avoid unnecessary Ray import warnings by @njhill in https://github.com/vllm-project/vllm/pull/6079
- [misc][distributed] error on invalid state by @youkaichao in https://github.com/vllm-project/vllm/pull/6092
- [VLM][Frontend] Proper Image Prompt Formatting from OpenAI API by @ywang96 in https://github.com/vllm-project/vllm/pull/6091
- [Doc] Fix Mock Import by @ywang96 in https://github.com/vllm-project/vllm/pull/6094
- [Bugfix] Fix
compute_logits
in Jamba by @ywang96 in https://github.com/vllm-project/vllm/pull/6093 - [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin by @mgoin in https://github.com/vllm-project/vllm/pull/5975
- [core][distributed] allow custom allreduce when pipeline parallel size > 1 by @youkaichao in https://github.com/vllm-project/vllm/pull/6117
- [vlm] Remove vision language config. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/6089
- [ Misc ] Clean Up
CompressedTensorsW8A8
by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6113 - [doc][misc] bump up py version in installation doc by @youkaichao in https://github.com/vllm-project/vllm/pull/6119
- [core][distributed] support layer size undividable by pp size in pipeline parallel inference by @youkaichao in https://github.com/vllm-project/vllm/pull/6115
- [Bugfix] set OMP_NUM_THREADS to 1 by default when using the multiproc_gpu_executor by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/6109
- [Distributed][Core] Support Py39 and Py38 for PP by @andoorve in https://github.com/vllm-project/vllm/pull/6120
- [CI/Build] Cleanup VLM tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6107
- [ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention by @gshtras in https://github.com/vllm-project/vllm/pull/6043
- [misc][doc] try to add warning for latest html by @youkaichao in https://github.com/vllm-project/vllm/pull/5979
- [Hardware][Intel CPU] Adding intel openmp tunings in Docker file by @zhouyuan in https://github.com/vllm-project/vllm/pull/6008
- [Kernel][Model] logits_soft_cap for Gemma2 with flashinfer by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/6051
- [VLM] Calculate maximum number of multi-modal tokens by model by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6121
- [VLM] Improve consistency between feature size calculation and dummy data for profiling by @ywang96 in https://github.com/vllm-project/vllm/pull/6146
- [VLM] Cleanup validation and update docs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6149
- [Bugfix] Use templated datasource in grafana.json to allow automatic imports by @frittentheke in https://github.com/vllm-project/vllm/pull/6136
- [Frontend] Continuous usage stats in OpenAI completion API by @jvlunteren in https://github.com/vllm-project/vllm/pull/5742
- [Bugfix] Add verbose error if scipy is missing for blocksparse attention by @JGSweets in https://github.com/vllm-project/vllm/pull/5695
- bump version to v0.5.1 by @simon-mo in https://github.com/vllm-project/vllm/pull/6157
- [Docs] Fix readthedocs for tag build by @simon-mo in https://github.com/vllm-project/vllm/pull/6158
New Contributors
- @kimdwkimdw made their first contribution in https://github.com/vllm-project/vllm/pull/5444
- @sywangyi made their first contribution in https://github.com/vllm-project/vllm/pull/5303
- @garg-amit made their first contribution in https://github.com/vllm-project/vllm/pull/5595
- @CharlesRiggins made their first contribution in https://github.com/vllm-project/vllm/pull/5592
- @zhyncs made their first contribution in https://github.com/vllm-project/vllm/pull/5588
- @bfontain made their first contribution in https://github.com/vllm-project/vllm/pull/5142
- @sroy745 made their first contribution in https://github.com/vllm-project/vllm/pull/5131
- @joerunde made their first contribution in https://github.com/vllm-project/vllm/pull/5559
- @sergey-tinkoff made their first contribution in https://github.com/vllm-project/vllm/pull/5178
- @milo157 made their first contribution in https://github.com/vllm-project/vllm/pull/5553
- @ShukantPal made their first contribution in https://github.com/vllm-project/vllm/pull/5628
- @rafvasq made their first contribution in https://github.com/vllm-project/vllm/pull/5614
- @JRosenkranz made their first contribution in https://github.com/vllm-project/vllm/pull/4947
- @rohithkrn made their first contribution in https://github.com/vllm-project/vllm/pull/5603
- @wooyeonlee0 made their first contribution in https://github.com/vllm-project/vllm/pull/5414
- @aws-patlange made their first contribution in https://github.com/vllm-project/vllm/pull/5841
- @stephanie-wang made their first contribution in https://github.com/vllm-project/vllm/pull/5408
- @ProExpertProg made their first contribution in https://github.com/vllm-project/vllm/pull/5560
- @ChipKerchner made their first contribution in https://github.com/vllm-project/vllm/pull/5652
- @ilya-lavrenov made their first contribution in https://github.com/vllm-project/vllm/pull/5379
- @mcalman made their first contribution in https://github.com/vllm-project/vllm/pull/5926
- @SolitaryThinker made their first contribution in https://github.com/vllm-project/vllm/pull/5925
- @llmpros made their first contribution in https://github.com/vllm-project/vllm/pull/5935
- @avshalomman made their first contribution in https://github.com/vllm-project/vllm/pull/6029
- @danieljannai21 made their first contribution in https://github.com/vllm-project/vllm/pull/5709
- @sirejdua made their first contribution in https://github.com/vllm-project/vllm/pull/6050
- @gshtras made their first contribution in https://github.com/vllm-project/vllm/pull/6043
- @frittentheke made their first contribution in https://github.com/vllm-project/vllm/pull/6136
- @jvlunteren made their first contribution in https://github.com/vllm-project/vllm/pull/5742
- @JGSweets made their first contribution in https://github.com/vllm-project/vllm/pull/5695
Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.0...v0.5.1
1、 vllm-0.5.1+cu118-cp310-cp310-manylinux1_x86_64.whl 140.54MB
2、 vllm-0.5.1+cu118-cp311-cp311-manylinux1_x86_64.whl 140.54MB
3、 vllm-0.5.1+cu118-cp38-cp38-manylinux1_x86_64.whl 140.54MB
4、 vllm-0.5.1+cu118-cp39-cp39-manylinux1_x86_64.whl 140.54MB
5、 vllm-0.5.1-cp310-cp310-manylinux1_x86_64.whl 140.1MB
6、 vllm-0.5.1-cp311-cp311-manylinux1_x86_64.whl 140.1MB
7、 vllm-0.5.1-cp38-cp38-manylinux1_x86_64.whl 140.1MB
8、 vllm-0.5.1-cp39-cp39-manylinux1_x86_64.whl 140.1MB