Papers

[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Clay
2024-11-162024-11-16
AI, Machine Learning, Papers

Highlights of This Paper

Quantization, pruning, and distillation can also accelerate models, but come with issues like changes in output distribution compared to the original model, as well as the cost of retraining.
The original Speculative Decoding faces the issue of requiring additional memory to run the draft model, whereas Self-Speculative Decoding uses part of its own neural network as the draft model.
The Adaptive Draft-Exiting Mechanism can automatically adjust the number of tokens predicted by the draft model based on confidence score thresholds.

Clay
2024-11-062024-11-06
AI, Machine Learning, Papers

Abstract

In auto-regressive model decoding, if we need to decode K tokens, we must go through the process K times, which is the current bottleneck in the inference time of large language models.

Clay
2024-10-162024-10-16
AI, Machine Learning, Papers

The following are some points in this paper:

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Papers

[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Highlights of This Paper

[Paper Reading] Fast Inference from Transformers via Speculative Decoding

Abstract

[Paper Reading] ENTP: ENCODER-ONLY NEXT TOKEN PREDICTION