Papers

[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Clay
2024-11-162024-11-16
AI, Machine Learning, Papers

Highlights of This Paper

Quantization, pruning, and distillation can also accelerate models, but come with issues like changes in output distribution compared to the original model, as well as the cost of retraining.
The original Speculative Decoding faces the issue of requiring additional memory to run the draft model, whereas Self-Speculative Decoding uses part of its own neural network as the draft model.
The Adaptive Draft-Exiting Mechanism can automatically adjust the number of tokens predicted by the draft model based on confidence score thresholds.

Clay
2024-11-062024-11-06
AI, Machine Learning, Papers

Abstract

In auto-regressive model decoding, if we need to decode K tokens, we must go through the process K times, which is the current bottleneck in the inference time of large language models.

Clay
2024-10-162024-10-16
AI, Machine Learning, Papers

The following are some points in this paper:

Clay
2024-08-242024-08-24
Machine Learning

Introduction

ColBERT is an embedding model designed specifically for retrieval tasks, transforming the tokens of Queries and Documents into embeddings and computing the maximum similarity.

Clay
2024-08-132024-08-19
AI, Machine Learning

Cross-lingual Modular (X-Mod) is an interesting language model architecture that modularizes the parameters for different languages as Module Units, allowing the model to use separate parameters when fine-tuning for a new language, thereby (comparatively) avoiding the problem of catastrophic forgetting.

Clay
2024-08-102024-08-10
Machine Learning, PyTorch

Introduction

The year 2023 witnessed an explosion of generative AI technologies, with a myriad of applications emerging across various domains. In the field of Natural Language Processing (NLP), Large Language Models (LLMs) stand out as one of the most significant advancements. By training LLMs effectively and reducing hallucinations, they can significantly reduce human effort across a wide range of tasks.

Clay
2024-07-212024-07-25
Machine Learning

Introduction

Mistral 7B is a large language model (LLM) proposed on September 27, 2023, trained by the Mistral AI team, which also released its weights as open source. Interestingly, it uses the highly permissive Apache 2.0 license, unlike Llama 2, which has its own Llama license terms. Therefore, Mistral 7B is truly “open source” (Llama’s license requires discussion with Meta AI when the service volume reaches 700 million).

Clay
2024-06-032024-07-25
Machine Learning, Python

Introduction

The accelerated framework is proposed by Huawei Noah’s Ark Lab, it replaces the small model used in the original speculative decoding with the shallow sub-network of the large model. Additionally, it employs an extra-trained adapter and the model’s own decoding head to generate speculative tokens, which are then verified by the large model. The subsequent operations are quite similar to the original speculative decoding process.

Clay
2024-01-222024-07-25
Machine Learning

Introduction

RAG-based LLM is a well-known architecture in current usage of Large Language Models (LLM). It involves “retrieval” to provide the model with prior knowledge that it lacks during training, enabling the model to answer questions in the context of specific information.

Clay
2024-01-212024-07-25
Machine Learning

Introduction

The wave of large models has been unstoppable since the release of ChatGPT in November 2022. Up to now, the scale of open-source Large Language Models (LLMs) continues to increase, such as LLaMA-2-70B and Falcon-180B, to name a few.

Papers

[Paper Reading] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Highlights of This Paper

[Paper Reading] Fast Inference from Transformers via Speculative Decoding

Abstract

[Paper Reading] ENTP: ENCODER-ONLY NEXT TOKEN PREDICTION

[Paper Reading] ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Introduction

[Paper Reading] Lifting the Curse of Multilinguality by Pre-training Modular Transformers

[Paper Reading] RAGAS: Automated Evaluation of Retrieval Augmented Generation

Introduction

[Paper Reading] Mistral 7B

Introduction

[Paper Reading] Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting

Introduction

[Paper Reading] Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Introduction

[Paper Reading] QLoRA: Efficient Finetuning of Quantized LLMs

Introduction