December 14, 2023

Using vLLM To Accelerate The Decoding Of Large Language Model

Clay
2023-12-142023-12-14
Machine Learning, Python, PyTorch

Last Updated on 2023-12-14 by Clay

Introduction

vLLM is a large language model (LLM) acceleration framework that developed by the research team of University of California, Berkeley. It is used PagedAttention to increase the usage rate of GPU VRAM, and this method do not change the model architecture.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31