v0.5.1

vllm-project/vllm

版本发布时间: 2024-07-06 03:47:01

vllm-project/vllm最新发布版本:v0.6.1(2024-09-12 05:44:44)

Highlights

vLLM now has pipeline parallelism! (#4412, #5408, #6115, #6120). You can now run the API server with --pipeline-parallel-size. This feature is in early stage, please let us know your feedback.

Model Support

Support Gemma 2 (#5908, #6051). Please note that for correctness, Gemma should run with FlashInfer backend which supports logits soft cap. The wheels for FlashInfer can be downloaded here
Support Jamba (#4115). This is vLLM's first state space model!
Support Deepseek-V2 (#4650). Please note that MLA (Multi-head Latent Attention) is not implemented and we are looking for contribution!
Vision Language Model adding support for Phi3-Vision, dynamic image size, and a registry for processing model inputs (#4986, #5276, #5214)
- Notably, it has a breaking change that all VLM specific arguments are now removed from engine APIs so you no longer need to set it globally via CLI. However, you now only need to pass in <image> into the prompt instead of complicated prompt formatting. See more here
- There is also a new guide on adding VLMs! We would love your contribution for new models!

Hardware Support

Enhancement to TPU support (#5292, #5878, #5850, #5831, #5855)
OpenVINO backend (#5379)

Production Service

Support for sharded tensorized models (#4990)
Continous streaming of OpenAI response token stats (#5742)

Performance

Enhancement in distributed communication via shared memory (#5399)
Latency enhancement in block manager (#5584)
Enhancements to compressed-tensors supporting Marlin, W4A16 (#5435, #5385)
Faster FP8 quantize kernel (#5396), FP8 on Ampere (#5975)
Option to use FlashInfer for prefill, decode, and CUDA Graph for decode (#4628)
Speculative Decoding
- MLPSpeculator (#4947, #6050)
- Typical Acceptance Sampler (#5131, #5348)
Draft Model Runner (#5799)

Development Productivity

Post merge benchmark is now available at perf.vllm.ai!
Addition of A100 in CI environment (#5658)
Step towards nightly wheel publication (#5610)

What's Changed

[CI/Build] Add is_quant_method_supported to control quantization test configurations by @mgoin in https://github.com/vllm-project/vllm/pull/5253
Revert "[CI/Build] Add is_quant_method_supported to control quantization test configurations" by @simon-mo in https://github.com/vllm-project/vllm/pull/5463
[CI] Upgrade codespell version. by @rkooo567 in https://github.com/vllm-project/vllm/pull/5381
[Hardware] Initial TPU integration by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5292
[Bugfix] Add device assertion to TorchSDPA by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/5402
[ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests by @khluu in https://github.com/vllm-project/vllm/pull/5464
[Kernel] Vectorized FP8 quantize kernel by @comaniac in https://github.com/vllm-project/vllm/pull/5396
[Bugfix] TYPE_CHECKING for MultiModalData by @kimdwkimdw in https://github.com/vllm-project/vllm/pull/5444
[Frontend] [Core] Support for sharded tensorized models by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/4990
[misc] add hint for AttributeError by @youkaichao in https://github.com/vllm-project/vllm/pull/5462
[Doc] Update debug docs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5438
[Bugfix] Fix typo in scheduler.py (requeset -> request) by @mgoin in https://github.com/vllm-project/vllm/pull/5470
[Frontend] Add "input speed" to tqdm postfix alongside output speed by @mgoin in https://github.com/vllm-project/vllm/pull/5425
[Bugfix] Fix wrong multi_modal_input format for CPU runner by @Isotr0py in https://github.com/vllm-project/vllm/pull/5451
[Core][Distributed] add coordinator to reduce code duplication in tp and pp by @youkaichao in https://github.com/vllm-project/vllm/pull/5293
[ci] Use sccache to build images by @khluu in https://github.com/vllm-project/vllm/pull/5419
[Bugfix]if the content is started with ":"(response of ping), client should i… by @sywangyi in https://github.com/vllm-project/vllm/pull/5303
[Kernel] w4a16 support for compressed-tensors by @dsikka in https://github.com/vllm-project/vllm/pull/5385
[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations by @mgoin in https://github.com/vllm-project/vllm/pull/5466
[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 by @wenyujin333 in https://github.com/vllm-project/vllm/pull/5497
[Hardware][Intel] Optimize CPU backend and add more performance tips by @bigPYJ1151 in https://github.com/vllm-project/vllm/pull/4971
[Docs] Add 4th meetup slides by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5509
[Misc] Add vLLM version getter to utils by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5098
[CI/Build] Simplify OpenAI server setup in tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5100
[Doc] Update LLaVA docs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5437
[Kernel] Factor out epilogues from cutlass kernels by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5391
[MISC] Remove FP8 warning by @comaniac in https://github.com/vllm-project/vllm/pull/5472
Seperate dev requirements into lint and test by @Yard1 in https://github.com/vllm-project/vllm/pull/5474
Revert "[Core] Remove unnecessary copies in flash attn backend" by @Yard1 in https://github.com/vllm-project/vllm/pull/5478
[misc] fix format.sh by @youkaichao in https://github.com/vllm-project/vllm/pull/5511
[CI/Build] Disable test_fp8.py by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5508
[Kernel] Disable CUTLASS kernels for fp8 by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5505
Add cuda_device_count_stateless by @Yard1 in https://github.com/vllm-project/vllm/pull/5473
[Hardware][Intel] Support CPU inference with AVX2 ISA by @DamonFool in https://github.com/vllm-project/vllm/pull/5452
[Bugfix]typofix by @AllenDou in https://github.com/vllm-project/vllm/pull/5507
bump version to v0.5.0.post1 by @simon-mo in https://github.com/vllm-project/vllm/pull/5522
[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with perf-benchmarks label by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5073
[CI/Build] Disable LLaVA-NeXT CPU test by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5529
[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5516
[Misc] Fix arg names by @AllenDou in https://github.com/vllm-project/vllm/pull/5524
[ Misc ] Rs/compressed tensors cleanup by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5432
[Kernel] Suppress mma.sp warning on CUDA 12.5 and later by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5401
[mis] fix flaky test of test_cuda_device_count_stateless by @youkaichao in https://github.com/vllm-project/vllm/pull/5546
[Core] Remove duplicate processing in async engine by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5525
[misc][distributed] fix benign error in is_in_the_same_node by @youkaichao in https://github.com/vllm-project/vllm/pull/5512
[Docs] Add ZhenFund as a Sponsor by @simon-mo in https://github.com/vllm-project/vllm/pull/5548
[Doc] Update documentation on Tensorizer by @sangstar in https://github.com/vllm-project/vllm/pull/5471
[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models by @tdoublep in https://github.com/vllm-project/vllm/pull/5460
[Bugfix] Fix typo in Pallas backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5558
[Core][Distributed] improve p2p cache generation by @youkaichao in https://github.com/vllm-project/vllm/pull/5528
Add ccache to amd by @simon-mo in https://github.com/vllm-project/vllm/pull/5555
[Core][Bugfix]: fix prefix caching for blockv2 by @leiwen83 in https://github.com/vllm-project/vllm/pull/5364
[mypy] Enable type checking for test directory by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5017
[CI/Build] Test both text and token IDs in batched OpenAI Completions API by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5568
[misc] Do not allow to use lora with chunked prefill. by @rkooo567 in https://github.com/vllm-project/vllm/pull/5538
add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5145
[BugFix] Don't start a Ray cluster when not using Ray by @njhill in https://github.com/vllm-project/vllm/pull/5570
[Fix] Correct OpenAI batch response format by @zifeitong in https://github.com/vllm-project/vllm/pull/5554
Add basic correctness 2 GPU tests to 4 GPU pipeline by @Yard1 in https://github.com/vllm-project/vllm/pull/5518
[CI][BugFix] Flip is_quant_method_supported condition by @mgoin in https://github.com/vllm-project/vllm/pull/5577
[build][misc] limit numpy version by @youkaichao in https://github.com/vllm-project/vllm/pull/5582
[Doc] add debugging tips for crash and multi-node debugging by @youkaichao in https://github.com/vllm-project/vllm/pull/5581
Fix w8a8 benchmark and add Llama-3-8B by @comaniac in https://github.com/vllm-project/vllm/pull/5562
[Model] Rename Phi3 rope scaling type by @garg-amit in https://github.com/vllm-project/vllm/pull/5595
Correct alignment in the seq_len diagram. by @CharlesRiggins in https://github.com/vllm-project/vllm/pull/5592
[Kernel] compressed-tensors marlin 24 support by @dsikka in https://github.com/vllm-project/vllm/pull/5435
[Misc] use AutoTokenizer for benchmark serving when vLLM not installed by @zhyncs in https://github.com/vllm-project/vllm/pull/5588
[Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend by @jikunshang in https://github.com/vllm-project/vllm/pull/3814
[CI/BUILD] Support non-AVX512 vLLM building and testing by @DamonFool in https://github.com/vllm-project/vllm/pull/5574
[CI] Improve the readability of performance benchmarking results and prepare for upcoming performance dashboard by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5571
[bugfix][distributed] fix 16 gpus local rank arrangement by @youkaichao in https://github.com/vllm-project/vllm/pull/5604
[Optimization] use a pool to reuse LogicalTokenBlock.token_ids by @youkaichao in https://github.com/vllm-project/vllm/pull/5584
[Bugfix] Fix KV head calculation for MPT models when using GQA by @bfontain in https://github.com/vllm-project/vllm/pull/5142
[Fix] Use utf-8 encoding in entrypoints/openai/run_batch.py by @zifeitong in https://github.com/vllm-project/vllm/pull/5606
[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier by @sroy745 in https://github.com/vllm-project/vllm/pull/5131
[Model] Initialize Phi-3-vision support by @Isotr0py in https://github.com/vllm-project/vllm/pull/4986
[Kernel] Add punica dimensions for Granite 13b by @joerunde in https://github.com/vllm-project/vllm/pull/5559
[misc][typo] fix typo by @youkaichao in https://github.com/vllm-project/vllm/pull/5620
[Misc] Fix typo by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5618
[CI] Avoid naming different metrics with the same name in performance benchmark by @KuntaiDu in https://github.com/vllm-project/vllm/pull/5615
[bugfix][distributed] do not error if two processes do not agree on p2p capability by @youkaichao in https://github.com/vllm-project/vllm/pull/5612
[Misc] Remove import from transformers logging by @CatherineSue in https://github.com/vllm-project/vllm/pull/5625
[CI/Build][Misc] Update Pytest Marker for VLMs by @ywang96 in https://github.com/vllm-project/vllm/pull/5623
[ci] Deprecate original CI template by @khluu in https://github.com/vllm-project/vllm/pull/5624
[Misc] Add OpenTelemetry support by @ronensc in https://github.com/vllm-project/vllm/pull/4687
[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization by @dsikka in https://github.com/vllm-project/vllm/pull/5542
[ci] Setup Release pipeline and build release wheels with cache by @khluu in https://github.com/vllm-project/vllm/pull/5610
[Model] LoRA support added for command-r by @sergey-tinkoff in https://github.com/vllm-project/vllm/pull/5178
[Bugfix] Fix for inconsistent behaviour related to sampling and repetition penalties by @tdoublep in https://github.com/vllm-project/vllm/pull/5639
[Doc] Added cerebrium as Integration option by @milo157 in https://github.com/vllm-project/vllm/pull/5553
[Bugfix] Fix CUDA version check for mma warning suppression by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5642
[Bugfix] Fix w8a8 benchmarks for int8 case by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5643
[Bugfix] Fix Phi-3 Long RoPE scaling implementation by @ShukantPal in https://github.com/vllm-project/vllm/pull/5628
[Bugfix] Added test for sampling repetition penalty bug. by @tdoublep in https://github.com/vllm-project/vllm/pull/5659
[Bugfix][CI/Build][AMD][ROCm]Fixed the cmake build bug which generate garbage on certain devices by @hongxiayang in https://github.com/vllm-project/vllm/pull/5641
[misc][distributed] use localhost for single-node by @youkaichao in https://github.com/vllm-project/vllm/pull/5619
[Model] Add FP8 kv cache for Qwen2 by @mgoin in https://github.com/vllm-project/vllm/pull/5656
[Bugfix] Fix sampling_params passed incorrectly in Phi3v example by @Isotr0py in https://github.com/vllm-project/vllm/pull/5684
[Misc]Add param max-model-len in benchmark_latency.py by @DearPlanet in https://github.com/vllm-project/vllm/pull/5629
[CI/Build] Add tqdm to dependencies by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5680
[ci] Add A100 queue into AWS CI template by @khluu in https://github.com/vllm-project/vllm/pull/5648
[Frontend][Bugfix] Fix preemption_mode -> preemption-mode for CLI arg in arg_utils.py by @mgoin in https://github.com/vllm-project/vllm/pull/5688
[ci][distributed] add tests for custom allreduce by @youkaichao in https://github.com/vllm-project/vllm/pull/5689
[Bugfix] AsyncLLMEngine hangs with asyncio.run by @zifeitong in https://github.com/vllm-project/vllm/pull/5654
[Doc] Update docker references by @rafvasq in https://github.com/vllm-project/vllm/pull/5614
[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes by @dsikka in https://github.com/vllm-project/vllm/pull/5650
[ci] Limit num gpus if specified for A100 by @khluu in https://github.com/vllm-project/vllm/pull/5694
[Misc] Improve conftest by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5681
[Bugfix][Doc] FIx Duplicate Explicit Target Name Errors by @ywang96 in https://github.com/vllm-project/vllm/pull/5703
[Kernel] Update Cutlass int8 kernel configs for SM90 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5514
[Model] Port over CLIPVisionModel for VLMs by @ywang96 in https://github.com/vllm-project/vllm/pull/5591
[Kernel] Update Cutlass int8 kernel configs for SM80 by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5275
[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5715
[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names by @mgoin in https://github.com/vllm-project/vllm/pull/5718
[distributed][misc] use fork by default for mp by @youkaichao in https://github.com/vllm-project/vllm/pull/5669
[Model] MLPSpeculator speculative decoding support by @JRosenkranz in https://github.com/vllm-project/vllm/pull/4947
[Kernel] Add punica dimension for Qwen2 LoRA by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/5441
[BugFix] Fix test_phi3v.py by @CatherineSue in https://github.com/vllm-project/vllm/pull/5725
[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora by @jeejeelee in https://github.com/vllm-project/vllm/pull/5665
[Core][Distributed] add shm broadcast by @youkaichao in https://github.com/vllm-project/vllm/pull/5399
[Kernel][CPU] Add Quick gelu to CPU by @ywang96 in https://github.com/vllm-project/vllm/pull/5717
[Doc] Documentation on supported hardware for quantization methods by @mgoin in https://github.com/vllm-project/vllm/pull/5745
[BugFix] exclude version 1.15.0 for modelscope by @zhyncs in https://github.com/vllm-project/vllm/pull/5668
[ci][test] fix ca test in main by @youkaichao in https://github.com/vllm-project/vllm/pull/5746
[LoRA] Add support for pinning lora adapters in the LRU cache by @rohithkrn in https://github.com/vllm-project/vllm/pull/5603
[CI][Hardware][Intel GPU] add Intel GPU(XPU) ci pipeline by @jikunshang in https://github.com/vllm-project/vllm/pull/5616
[Model] Support Qwen-VL and Qwen-VL-Chat models with text-only inputs by @DamonFool in https://github.com/vllm-project/vllm/pull/5710
[Misc] Remove #4789 workaround left in vllm/entrypoints/openai/run_batch.py by @zifeitong in https://github.com/vllm-project/vllm/pull/5756
[Bugfix] Fix pin_lora error in TPU executor by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5760
[Docs][TPU] Add installation tip for TPU by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5761
[core][distributed] improve shared memory broadcast by @youkaichao in https://github.com/vllm-project/vllm/pull/5754
[BugFix] [Kernel] Add Cutlass2x fallback kernels by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/5744
[Distributed] Add send and recv helpers by @andoorve in https://github.com/vllm-project/vllm/pull/5719
[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement by @Isotr0py in https://github.com/vllm-project/vllm/pull/5772
[doc][faq] add warning to download models for every nodes by @youkaichao in https://github.com/vllm-project/vllm/pull/5783
[Doc] Add "Suggest edit" button to doc pages by @mgoin in https://github.com/vllm-project/vllm/pull/5789
[Doc] Add Phi-3-medium to list of supported models by @mgoin in https://github.com/vllm-project/vllm/pull/5788
[Bugfix] Fix FlexibleArgumentParser replaces _ with - for actual args by @CatherineSue in https://github.com/vllm-project/vllm/pull/5795
[ci] Remove aws template by @khluu in https://github.com/vllm-project/vllm/pull/5757
[Doc] Add notice about breaking changes to VLMs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5818
[Speculative Decoding] Support draft model on different tensor-parallel size than target model by @wooyeonlee0 in https://github.com/vllm-project/vllm/pull/5414
[Misc] Remove useless code in cpu_worker by @DamonFool in https://github.com/vllm-project/vllm/pull/5824
[Core] Add fault tolerance for RayTokenizerGroupPool by @Yard1 in https://github.com/vllm-project/vllm/pull/5748
[doc][distributed] add both gloo and nccl tests by @youkaichao in https://github.com/vllm-project/vllm/pull/5834
[CI/Build] Add unit testing for FlexibleArgumentParser by @mgoin in https://github.com/vllm-project/vllm/pull/5798
[Misc] Update w4a16 compressed-tensors support to include w8a16 by @dsikka in https://github.com/vllm-project/vllm/pull/5794
[Hardware][TPU] Refactor TPU backend by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5831
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes by @mawong-amd in https://github.com/vllm-project/vllm/pull/5422
[Hardware][TPU] Raise errors for unsupported sampling params by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5850
[CI/Build] Add E2E tests for MLPSpeculator by @tdoublep in https://github.com/vllm-project/vllm/pull/5791
[Bugfix] Fix assertion in NeuronExecutor by @aws-patlange in https://github.com/vllm-project/vllm/pull/5841
[Core] Refactor Worker and ModelRunner to consolidate control plane communication by @stephanie-wang in https://github.com/vllm-project/vllm/pull/5408
[Misc][Doc] Add Example of using OpenAI Server with VLM by @ywang96 in https://github.com/vllm-project/vllm/pull/5832
[bugfix][distributed] fix shm broadcast when the queue size is full by @youkaichao in https://github.com/vllm-project/vllm/pull/5801
[Bugfix] Fix embedding to support 2D inputs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5829
[Bugfix][TPU] Fix KV cache size calculation by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5860
[CI/Build] Refactor image test assets by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5821
[Kernel] Adding bias epilogue support for cutlass_scaled_mm by @ProExpertProg in https://github.com/vllm-project/vllm/pull/5560
[Frontend] Add tokenize/detokenize endpoints by @sasha0552 in https://github.com/vllm-project/vllm/pull/5054
[Hardware][TPU] Support parallel sampling & Swapping by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5855
[Bugfix][TPU] Fix CPU cache allocation by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5869
Support CPU inference with VSX PowerPC ISA by @ChipKerchner in https://github.com/vllm-project/vllm/pull/5652
[doc] update usage of env var to avoid conflict by @youkaichao in https://github.com/vllm-project/vllm/pull/5873
[Misc] Add example for LLaVA-NeXT by @ywang96 in https://github.com/vllm-project/vllm/pull/5879
[BugFix] Fix cuda graph for MLPSpeculator by @njhill in https://github.com/vllm-project/vllm/pull/5875
[Doc] Add note about context length in Phi-3-Vision example by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5887
[VLM][Bugfix] Make sure that multi_modal_kwargs is broadcasted properly by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5880
[Model] Add base class for LoRA-supported models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5018
[Bugfix] Fix img_sizes Parsing in Phi3-Vision by @ywang96 in https://github.com/vllm-project/vllm/pull/5888
[CI/Build] [1/3] Reorganize entrypoints tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5526
[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5896
[doc][misc] add note for Kubernetes users by @youkaichao in https://github.com/vllm-project/vllm/pull/5916
[BugFix] Fix MLPSpeculator handling of num_speculative_tokens by @njhill in https://github.com/vllm-project/vllm/pull/5876
[BugFix] Fix min_tokens behaviour for multiple eos tokens by @njhill in https://github.com/vllm-project/vllm/pull/5849
[CI/Build] Fix Args for _get_logits_warper in Sampler Test by @ywang96 in https://github.com/vllm-project/vllm/pull/5922
[Model] Add Gemma 2 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5908
[core][misc] remove logical block by @youkaichao in https://github.com/vllm-project/vllm/pull/5882
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X by @divakar-amd in https://github.com/vllm-project/vllm/pull/5932
[Hardware][TPU] Optimize KV cache swapping by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5878
[VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5905
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner by @Isotr0py in https://github.com/vllm-project/vllm/pull/5956
[Core] Registry for processing model inputs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5214
Unmark fused_moe config json file as executable by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5960
[Hardware][Intel] OpenVINO vLLM backend by @ilya-lavrenov in https://github.com/vllm-project/vllm/pull/5379
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high by @tdoublep in https://github.com/vllm-project/vllm/pull/5894
[CI/Build] [2/3] Reorganize entrypoints tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5904
[Distributed] Make it clear that % should not be in tensor dict keys. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5927
[Spec Decode] Introduce DraftModelRunner by @comaniac in https://github.com/vllm-project/vllm/pull/5799
[Bugfix] Fix compute datatype for cutlass 3.x epilogues by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5931
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5928
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5921
Support Deepseek-V2 by @zwd003 in https://github.com/vllm-project/vllm/pull/4650
[Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled by @mgoin in https://github.com/vllm-project/vllm/pull/5936
Unmark more files as executable by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/5962
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5963
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/4628
[Bugfix][TPU] Fix TPU sampler output by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5978
[Bugfix][TPU] Fix pad slot id by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5977
[Bugfix] fix missing last itl in openai completions benchmark by @mcalman in https://github.com/vllm-project/vllm/pull/5926
[Misc] Extend vLLM Metrics logging API by @SolitaryThinker in https://github.com/vllm-project/vllm/pull/5925
[Kernel] Add punica dimensions for Granite 3b and 8b by @joerunde in https://github.com/vllm-project/vllm/pull/5930
[Bugfix] Fix precisions in Gemma 1 by @WoosukKwon in https://github.com/vllm-project/vllm/pull/5913
[Misc] Update Phi-3-Vision Example by @ywang96 in https://github.com/vllm-project/vllm/pull/5981
[Bugfix] Support eos_token_id from config.json by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5954
[Core] Optimize SequenceStatus.is_finished by switching to IntEnum by @Yard1 in https://github.com/vllm-project/vllm/pull/5974
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k by @comaniac in https://github.com/vllm-project/vllm/pull/5939
[ CI/Build ] Added E2E Test For Compressed Tensors by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5839
[CI/Build] Add TP test for vision models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5892
[ CI/Build ] LM Eval Harness Based CI Testing by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5838
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests by @mawong-amd in https://github.com/vllm-project/vllm/pull/5949
[CI/Build] Temporarily Remove Phi3-Vision from TP Test by @ywang96 in https://github.com/vllm-project/vllm/pull/5989
[CI/Build] Reuse code for checking output consistency by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5988
[CI/Build] [3/3] Reorganize entrypoints tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5966
[ci][distributed] fix some cuda init that makes it necessary to use spawn by @youkaichao in https://github.com/vllm-project/vllm/pull/5991
[Frontend]: Support base64 embedding by @llmpros in https://github.com/vllm-project/vllm/pull/5935
[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. by @rkooo567 in https://github.com/vllm-project/vllm/pull/5909
[ CI ] Temporarily Disable Large LM-Eval Tests by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6005
[Misc] Fix get_min_capability by @dsikka in https://github.com/vllm-project/vllm/pull/5971
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5940
[misc][cuda] use nvml query to avoid accidentally cuda initialization by @youkaichao in https://github.com/vllm-project/vllm/pull/6007
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker by @sroy745 in https://github.com/vllm-project/vllm/pull/5348
[ CI ] Re-enable Large Model LM Eval by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6031
[doc][misc] remove deprecated api server in doc by @youkaichao in https://github.com/vllm-project/vllm/pull/6037
[Misc] update benchmark backend for scalellm by @zhyncs in https://github.com/vllm-project/vllm/pull/6018
[doc][misc] further lower visibility of simple api server by @youkaichao in https://github.com/vllm-project/vllm/pull/6041
[Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool by @Yard1 in https://github.com/vllm-project/vllm/pull/6039
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs by @avshalomman in https://github.com/vllm-project/vllm/pull/6029
add FAQ doc under 'serving' by @llmpros in https://github.com/vllm-project/vllm/pull/5946
[Bugfix][Doc] Fix Doc Formatting by @ywang96 in https://github.com/vllm-project/vllm/pull/6048
[Bugfix] Add explicit end_forward calls to flashinfer by @Yard1 in https://github.com/vllm-project/vllm/pull/6044
[BugFix] Ensure worker model loop is always stopped at the right time by @njhill in https://github.com/vllm-project/vllm/pull/5987
[Frontend] Relax api url assertion for openai benchmarking by @jamestwhedbee in https://github.com/vllm-project/vllm/pull/6046
[Model] Changes to MLPSpeculator to support tie_weights and input_scale by @tdoublep in https://github.com/vllm-project/vllm/pull/5965
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) by @alexm-neuralmagic in https://github.com/vllm-project/vllm/pull/5602
[Frontend] Add template related params to request by @danieljannai21 in https://github.com/vllm-project/vllm/pull/5709
[VLM] Remove image_input_type from VLM config by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/5852
[Doc] Reinstate doc dependencies by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6061
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) by @sirejdua in https://github.com/vllm-project/vllm/pull/6050
[Core] Pipeline Parallel Support by @andoorve in https://github.com/vllm-project/vllm/pull/4412
Update conftest.py by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6076
[ Misc ] Refactor MoE to isolate Fp8 From Mixtral by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/5970
[CORE] Quantized lm-head Framework by @Qubitium in https://github.com/vllm-project/vllm/pull/4442
[Model] Jamba support by @mzusman in https://github.com/vllm-project/vllm/pull/4115
[hardware][misc] introduce platform abstraction by @youkaichao in https://github.com/vllm-project/vllm/pull/6080
[Core] Dynamic image size support for VLMs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/5276
[CI] Fix base url doesn't strip "/" by @rkooo567 in https://github.com/vllm-project/vllm/pull/6087
[BugFix] Avoid unnecessary Ray import warnings by @njhill in https://github.com/vllm-project/vllm/pull/6079
[misc][distributed] error on invalid state by @youkaichao in https://github.com/vllm-project/vllm/pull/6092
[VLM][Frontend] Proper Image Prompt Formatting from OpenAI API by @ywang96 in https://github.com/vllm-project/vllm/pull/6091
[Doc] Fix Mock Import by @ywang96 in https://github.com/vllm-project/vllm/pull/6094
[Bugfix] Fix compute_logits in Jamba by @ywang96 in https://github.com/vllm-project/vllm/pull/6093
[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin by @mgoin in https://github.com/vllm-project/vllm/pull/5975
[core][distributed] allow custom allreduce when pipeline parallel size > 1 by @youkaichao in https://github.com/vllm-project/vllm/pull/6117
[vlm] Remove vision language config. by @xwjiang2010 in https://github.com/vllm-project/vllm/pull/6089
[ Misc ] Clean Up CompressedTensorsW8A8 by @robertgshaw2-neuralmagic in https://github.com/vllm-project/vllm/pull/6113
[doc][misc] bump up py version in installation doc by @youkaichao in https://github.com/vllm-project/vllm/pull/6119
[core][distributed] support layer size undividable by pp size in pipeline parallel inference by @youkaichao in https://github.com/vllm-project/vllm/pull/6115
[Bugfix] set OMP_NUM_THREADS to 1 by default when using the multiproc_gpu_executor by @tjohnson31415 in https://github.com/vllm-project/vllm/pull/6109
[Distributed][Core] Support Py39 and Py38 for PP by @andoorve in https://github.com/vllm-project/vllm/pull/6120
[CI/Build] Cleanup VLM tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6107
[ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention by @gshtras in https://github.com/vllm-project/vllm/pull/6043
[misc][doc] try to add warning for latest html by @youkaichao in https://github.com/vllm-project/vllm/pull/5979
[Hardware][Intel CPU] Adding intel openmp tunings in Docker file by @zhouyuan in https://github.com/vllm-project/vllm/pull/6008
[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer by @LiuXiaoxuanPKU in https://github.com/vllm-project/vllm/pull/6051
[VLM] Calculate maximum number of multi-modal tokens by model by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6121
[VLM] Improve consistency between feature size calculation and dummy data for profiling by @ywang96 in https://github.com/vllm-project/vllm/pull/6146
[VLM] Cleanup validation and update docs by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/6149
[Bugfix] Use templated datasource in grafana.json to allow automatic imports by @frittentheke in https://github.com/vllm-project/vllm/pull/6136
[Frontend] Continuous usage stats in OpenAI completion API by @jvlunteren in https://github.com/vllm-project/vllm/pull/5742
[Bugfix] Add verbose error if scipy is missing for blocksparse attention by @JGSweets in https://github.com/vllm-project/vllm/pull/5695
bump version to v0.5.1 by @simon-mo in https://github.com/vllm-project/vllm/pull/6157
[Docs] Fix readthedocs for tag build by @simon-mo in https://github.com/vllm-project/vllm/pull/6158

New Contributors

@kimdwkimdw made their first contribution in https://github.com/vllm-project/vllm/pull/5444
@sywangyi made their first contribution in https://github.com/vllm-project/vllm/pull/5303
@garg-amit made their first contribution in https://github.com/vllm-project/vllm/pull/5595
@CharlesRiggins made their first contribution in https://github.com/vllm-project/vllm/pull/5592
@zhyncs made their first contribution in https://github.com/vllm-project/vllm/pull/5588
@bfontain made their first contribution in https://github.com/vllm-project/vllm/pull/5142
@sroy745 made their first contribution in https://github.com/vllm-project/vllm/pull/5131
@joerunde made their first contribution in https://github.com/vllm-project/vllm/pull/5559
@sergey-tinkoff made their first contribution in https://github.com/vllm-project/vllm/pull/5178
@milo157 made their first contribution in https://github.com/vllm-project/vllm/pull/5553
@ShukantPal made their first contribution in https://github.com/vllm-project/vllm/pull/5628
@rafvasq made their first contribution in https://github.com/vllm-project/vllm/pull/5614
@JRosenkranz made their first contribution in https://github.com/vllm-project/vllm/pull/4947
@rohithkrn made their first contribution in https://github.com/vllm-project/vllm/pull/5603
@wooyeonlee0 made their first contribution in https://github.com/vllm-project/vllm/pull/5414
@aws-patlange made their first contribution in https://github.com/vllm-project/vllm/pull/5841
@stephanie-wang made their first contribution in https://github.com/vllm-project/vllm/pull/5408
@ProExpertProg made their first contribution in https://github.com/vllm-project/vllm/pull/5560
@ChipKerchner made their first contribution in https://github.com/vllm-project/vllm/pull/5652
@ilya-lavrenov made their first contribution in https://github.com/vllm-project/vllm/pull/5379
@mcalman made their first contribution in https://github.com/vllm-project/vllm/pull/5926
@SolitaryThinker made their first contribution in https://github.com/vllm-project/vllm/pull/5925
@llmpros made their first contribution in https://github.com/vllm-project/vllm/pull/5935
@avshalomman made their first contribution in https://github.com/vllm-project/vllm/pull/6029
@danieljannai21 made their first contribution in https://github.com/vllm-project/vllm/pull/5709
@sirejdua made their first contribution in https://github.com/vllm-project/vllm/pull/6050
@gshtras made their first contribution in https://github.com/vllm-project/vllm/pull/6043
@frittentheke made their first contribution in https://github.com/vllm-project/vllm/pull/6136
@jvlunteren made their first contribution in https://github.com/vllm-project/vllm/pull/5742
@JGSweets made their first contribution in https://github.com/vllm-project/vllm/pull/5695

Full Changelog: https://github.com/vllm-project/vllm/compare/v0.5.0...v0.5.1

相关地址：原始地址下载(tar) 下载(zip)

1、 vllm-0.5.1+cu118-cp310-cp310-manylinux1_x86_64.whl 140.54MB

2、 vllm-0.5.1+cu118-cp311-cp311-manylinux1_x86_64.whl 140.54MB

3、 vllm-0.5.1+cu118-cp38-cp38-manylinux1_x86_64.whl 140.54MB

4、 vllm-0.5.1+cu118-cp39-cp39-manylinux1_x86_64.whl 140.54MB

5、 vllm-0.5.1-cp310-cp310-manylinux1_x86_64.whl 140.1MB

6、 vllm-0.5.1-cp311-cp311-manylinux1_x86_64.whl 140.1MB

7、 vllm-0.5.1-cp38-cp38-manylinux1_x86_64.whl 140.1MB

8、 vllm-0.5.1-cp39-cp39-manylinux1_x86_64.whl 140.1MB

查看：2024-07-06发行的版本