Last Updated on 2024-07-25 by Clay
Introduction
Mistral 7B is a large language model (LLM) proposed on September 27, 2023, trained by the Mistral AI team, which also released its weights as open source. Interestingly, it uses the highly permissive Apache 2.0 license, unlike Llama 2, which has its own Llama license terms. Therefore, Mistral 7B is truly “open source” (Llama’s license requires discussion with Meta AI when the service volume reaches 700 million).
The research team proudly announced that Mistral exhibits the following groundbreaking performance:
- Surpasses Llama 2 13B in all benchmarks
- Outperforms Llama 1 34B in certain benchmarks
- Nearly matches CodeLlama 7B in programming capabilities and excels in English tasks
- Uses Grouped-query Attention (GQA) to accelerate inference
- Employs Sliding Window Attention (SWA) to handle longer sequences with less overhead
As of late 2023 (November), some of the best-known 7B models, such as zephyr-7b-beta and OpenHermes-2.5-Mistral-7B, are also fine-tuned versions of Mistral. This shows the considerable potential of Mistral as a 7B-scale model, making it suitable for many generative tasks.
Architecture
Mistral is essentially based on the Transformer architecture and is also an auto-regressive model, using previous inputs to generate the next token until the sequence is completed.
Mistral’s noteworthy architectural features include Sliding Window Attention (SWA), Rolling Buffer Cache, and Pre-fill Chunking.
Below, I will briefly explain my understanding of these features.
Sliding Window Attention (SWA)
Although Sliding Window Attention was not first proposed by Mistral, it is an interesting choice for architectural design. The length of the sliding window is limited, so in tasks with long context lengths, the time complexity is O(n*w) (where w is the window size) instead of the original self-attention mechanism’s O(n^2).
In other words, for generation, assuming our KV dimensions are (batch_size, num_head, seq_len, head_size), the seq_len will remain fixed. Tokens beyond the fixed length will not participate in the attention mechanism’s calculations.
For me, this highly efficient attention mechanism still raises concerns, such as its ability to expand context and perform “needle in a haystack” tests (or even the more recent “star capture” tests as of April 2024).
If Sliding Window Attention can be proven effective, I have no doubt it will be the stable next version of the Transformer architecture.
Additionally, since Mistral employs the Group-Query Attention (GQA) mechanism, the number of heads for kv is not entirely consistent with Q, further reducing the number of parameters.
The Mistral development team also emphasizes that even with a fixed attention window, the calculated information is not limited to the current token’s view. The model stacks many layers, so the weights calculated later include information from a certain number of previous tokens—this is what the Effective Context Length diagram aims to illustrate.
Rolling Buffer Cache
Due to the use of Sliding Window Attention, the data stored in the KV Cache can also be fixed in size, hence the concept of Rolling Buffer Cache.
In simple terms, unlike traditional KV Cache that grows indefinitely, the Sliding Window Attention mechanism has an upper limit for KV Cache (as the window size is fixed), allowing for rolling updates of the KV Cache.
Pre-fill and Chunking
Pre-fill basically means pre-filling the KV Cache.
Interestingly, although Mistral’s architectural design has its merits, it does not employ particularly novel or magical methods. The model’s excellent performance has led many researchers to assume that the quality of its training data is exceptionally high.
According to a certain configuration file, it is likely that the training data scale is about 8T tokens. This suggests that data quality might play a more critical role than model architecture and tuning.
Performance
The experimental results in this technical report reveal the strong performance of Mistral. Subsequent LLMs fine-tuned based on Mistral also exhibit excellent performance, such as HuggingFace’s Zephyr and NousResearch’s OpenHermes, both of which I have used extensively for testing and research.
Reflections
I started writing this article back in November 2023, and it has been written intermittently until February 2024 (and now it is April 2024). Although Mistral-7B has not been entirely outdated or superseded, the subsequent Mixtral-8x7b by the MistralAI team has been so impressive that it has sparked a trend of MoE architecture in the open-source community, setting new benchmarks and even catching up with or surpassing GPT-3.5.
Later, Mistral-large also demonstrated remarkable performance, but unfortunately, it was not open-sourced, which was quite disappointing.
Recently, I studied the source code in HuggingFace’s transformers library and tried to implement Mistral’s architecture myself, gaining some valuable insights and a deeper understanding of the Mistral model’s design. I hope to share my code in the future.
During the replication of Mistral’s architecture, I also implemented RoPE. You can refer to my notes: [Machine Learning] Rotary Position Embedding (RoPE) Notes