Skip to content

[Machine Learning] Vector Quantization (VQ) Notes

Last Updated on 2024-10-03 by Clay

The first time I heard about Vector Quantization (VQ) was from a friend who was working on audio processing, which gave me a vague understanding that VQ is a technique used for data feature compression and representation. At that time, I still wasn't clear on how it differed from dimensionality reduction techniques like PCA.

However, yesterday, when I came across the Emu3 architecture proposed by BAAI, I was amazed by their idea of generating all modalities (videos, images, text) using the same token units, without relying on other image model architectures (such as Stable Diffusion, GAN, etc.). It is a multimodal model purely based on a Transformer architecture for predicting tokens to generate images and text. I became deeply interested, so I explored how the smallest patches of an image are encoded and decoded, and eventually traced it down to the concept of vector quantization.

To be honest, although I consider myself well-versed in various model architectures as an AI engineer, there are still many techniques I am not familiar with or fully understand. The only solution is to keep learning diligently.

Simply put, the technique of vector quantization is actually widely used in deep learning. Apart from audio, it also finds applications in natural language processing and computer vision. More specifically, vector quantization essentially involves mapping high-dimensional data to a finite discrete codebook, which reduces data storage and computational complexity.

At first, I was a bit puzzled about why there is an emphasis on "discrete." But then, thinking about it, a codebook is like a dictionary-like mapping, and the corresponding vectors of high-dimensional data are inherently finite, predefined, independent, and do not influence each other.


Hands-On Testing

The simplest way to test it is by clustering with K-means, where each cluster center serves as the element compressed and stored in the codebook.

First, let's generate four random clusters of data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs


# Generate dataset
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=2999)

# Plot original data
plt.scatter(X[:, 0], X[:, 1], s=30, color="blue", label="Original Data")
plt.title("Original Data")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()


Output:


Next, use K-means to find the cluster centers and plot the results.

from sklearn.cluster import KMeans


k = 4
kmeans = KMeans(n_clusters=k, random_state=2999)
kmeans.fit(X)


# Get codebook
codebook = kmeans.cluster_centers_

# Get labels
labels = kmeans.labels_

# Plot
plt.scatter(X[:, 0], X[:, 1], c=labels, s=30, cmap="viridis", label="Quantized Data")
plt.scatter(codebook[:, 0], codebook[:, 1], s=200, c="red", marker="X", label="Codebook")

plt.title("Vector Quantization using K-Means")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()


Output:


It seems that the similar patterns, directions, and layouts in an image can indeed be covered by around 30,000 codebooks for a single 16x16 patch. In other words, it's like saying the details of a 16x16 image can be represented by approximately 30,000 low-dimensional vectors as features.

Of course, this kind of mapping from high-dimensional to low-dimensional space is inevitably a lossy compression. After looking into some of the developments of VQ, I feel that I still have a lot to learn to fully understand the Emu architecture, especially since their codebook is learned and their features are far more complex than the two-dimensional toy dataset used here.


References


Read More

Leave a Reply