[Paper Reading] ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Last Updated on 2024-08-24 by Clay

Introduction

ColBERT is an embedding model designed specifically for retrieval tasks, transforming the tokens of Queries and Documents into embeddings and computing the maximum similarity.

As shown in the image, compute the cosine similarity between all Query tokens (assuming there are N tokens) and Document tokens, and retain the highest MaxSim scores (a total of N scores).

Then, summing up MaxSim gives the score for this Query against the Document. This method is named Late Interaction.

架構

ColBERT contains two encoders, a Query Encoder and a Document Encoder, both based on the BERT or similar encoder-only transformer architecture.

Naturally, the Query Encoder is used to transform the tokens of a Query, while the Document Encoder is used to transform the tokens of a Document.

This architecture is different from a cross-encoder, which combines the Query and Document into a single input sequence.

Query Encoder

When given a query, in addition to the results after tokenization, a special symbol [Q] is inserted to the right of the [CLS] token; besides, the [MASK] special symbol is used to pad the Query to our predefined length Nq (if the token sequence exceeds Nq, it will be truncated).

Then, we input each token's representation vector obtained from BERT into a linear layer without activation functions, transforming it into the specified dimension of m.

Document Encoder

The process is nearly identical to that of the Query Encoder, except that the special symbol inserted in front is [D] and there is no need for any padding.

Late Interaction

Thus, we can use the Encoders to separately calculate the token embeddings of the Query and Document.

The actual formula for calculating Late Interaction is:

Interestingly, while reading the paper, they mentioned that although RNNs or CNNs could be used, the best results are still obtained with transformer-based encoders (i.e., BERT). Thus, it is unclear why CNNs are mentioned in the formula.

Normalization is necessary, after all, for computing Cosine Similarity; additionally, upon further research, it was found that Filter refers to the removal of punctuation marks.

I have yet to review the GitHub repo implementation, as I plan to read through ColBERTv2 before doing so. It feels a bit like following a serial novel.

總結

The effectiveness of ColBERT will not be considered at this time. Partly because this is work from 2020, it would be more efficient to directly read a comparative evaluation of ColBERTv2.

However, from a personal perspective, I feel that ColBERT, considering the maximum similarity between each token of the Query and Document, is very much like a finer-grained similarity calculation, somewhat similar to a cross-encoder.

Compared to a cross-encoder, however, ColBERT can perform offline Document embeddings, which combined with the currently popular (2024) vector databases, undoubtedly improves performance.

In short, I consider ColBERT to be another kind of cross-encoder, only it processes the Query and Document separately, thereby achieving the capability for offline retrieval.

References

[Paper Reading] Lifting the Curse of Multilinguality by Pre-training Modular Transformers

[Paper Reading] ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

Introduction