Skip to content

ImageBind: A Experience Notes on a Multimodal Vector Transformation Model

Introduction

Meta AI has indeed been incredibly powerful recently, seemingly securing its position as a giant in AI research and development in no time at all, and what’s more, it sets the bar high with all its top-tier open-source contributions. From Segment Anything that can segment objects in the image domain, to the public large language model and foundational model, LLaMA (yes, the one causing the llama family appear!), to the recent ImageBind that can transform six modalities and the Massively Multilingual Speech (MMS) project… I must say, for an ordinary person like me, it’s quite an effort to keep up with how to use these technologies, let alone trying to chase their technical prowess.

As I haven’t fully read the paper “IMAGEBIND: One Embedding Space To Bind Them All,” and have only skimmed it, I won’t delve into the details of its principles today. Hopefully, when I have some spare time in the future, I can document my impressions and thoughts after reading it.

Today, the main focus is to introduce how to use the ImageBind model that Meta AI has already open-sourced. Of course, their official GitHub has provided clear instructions, but I will just supplement with some minor pitfalls I stumbled upon during my experience.


The special of ImageBind

Firstly, ImageBind is a model capable of transforming data into vectors, or what you might prefer to call embeddings. While it mentions six modalities: text, vision, audio, depth, thermal, and inertial measurement unit (IMU, which can be understood as motion trends), essentially, apart from text (which uses CLIP’s text extraction, refer to the source code at https://github.com/facebookresearch/ImageBind/blob/main/models/imagebind_model.py), all other modalities are converted into an image format for the Vision Transformer (ViT) to process.

Even audio is processed this way, being converted into a spectrogram for ViT to handle. However, this is not the reason why the model is named ImageBind.

In practice, we find that if different modalities describe the same object, their vector representations tend to be very similar. For instance, the word “cat” and an image of a cat will be close in vector space, and so will an image of a cat and the sound of a cat meowing!

So how does ImageBind achieve vector alignment among different modalities? I initially naively assumed that Meta AI had prepared an exhaustive dataset of paired data combinations, with each paired combination undergoing contrastive learning: bringing similar pairs closer in vector space and pushing dissimilar pairs further apart.

I later discovered that ImageBind essentially uses combinations of the form (Image, D) – where the image is fixed, and contrastive learning is applied to the other five modalities. That is, it ensures that the output dimensions of the final layer of the neural network for each modality are consistent, and then uses InfoNCE Loss for model training after normalization.

Image source from: https://arxiv.org/abs/2305.05665

In this case, q is the normalized embedding of the image, k is the normalized embedding of another corresponding modality during training, and τ is the temperature hyperparameter that controls the smoothness of softmax.

After this type of training, amazingly, “cat” and the “sound of a cat meowing” become extremely close entities in vector space, i.e., they have been aligned!

Upon reflection, this isn’t surprising or even unexpected, it’s actually quite reasonable: because “cat” and an “image of a cat” are very close in vector space, and the “sound of a cat meowing” is also very close to an “image of a cat”. Doesn’t this mean that “cat” and the “sound of a cat meowing” are very close?

That is to say, the “image of a cat” serves as an anchor point, binding vectors of different modalities but similar themes together!

Therefore, ImageBind is truly a powerful model, capable of generating universally applicable vectors across modalities.


How to use ImageBind?

If you want to use ImageBind, first, you need to clone the GitHub repo.

git clone https://github.com/facebookresearch/ImageBind/tree/main


And then, we need to download the open-source ImageBind model.

mkdir ImageBind/.checkpoints
cd ImageBind/.checkpoints

wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth


The next step is the most troublesome: package installation. I strongly recommend using conda for installation, as I encountered a slew of errors and difficulties when I tried to install cartopy directly with pip. The ultimate solution I found was from https://github.com/SciTools/cartopy/issues/1940, which involved using conda install for the installation.

If you’ve followed the steps up to this point, we should now be in the .checkpoints folder. First, use cd ../ to go back up one level and confirm that the requirements.txt file exists in the current directory. Only then do we start the installation process.

conda create --name imagebind python=3.8 -y
conda activate imagebind

pip install -r requirements.txt
conda install -c conda-forge cartopy



If an error occurs when installing the cartopy package, use conda install for the installation. Afterwards, the model should function normally. If other packages fail to install successfully, you may need to search online for various solutions.

Given that there are six modalities, it’s unsurprising that the preprocessing packages required are substantially more numerous than those for a single modality.

Finally, we are ready to begin testing the model! Of course, we need to be within the project folder in order to call upon the pre-written code modules provided by the official team.

import data
import torch
from models import imagebind_model
from models.imagebind_model import ModalityType

text_list=["A dog.", "A car", "A bird"]
image_paths=[".assets/dog_image.jpg", ".assets/car_image.jpg", ".assets/bird_image.jpg"]
audio_paths=[".assets/dog_audio.wav", ".assets/car_audio.wav", ".assets/bird_audio.wav"]

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# Instantiate model
model = imagebind_model.imagebind_huge(pretrained=True)
model.eval()
model.to(device)

# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

print(
    "Vision x Text: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
    "Audio x Text: ",
    torch.softmax(embeddings[ModalityType.AUDIO] @ embeddings[ModalityType.TEXT].T, dim=-1),
)
print(
    "Vision x Audio: ",
    torch.softmax(embeddings[ModalityType.VISION] @ embeddings[ModalityType.AUDIO].T, dim=-1),
)

# Expected output:
#
# Vision x Text:
# tensor([[9.9761e-01, 2.3694e-03, 1.8612e-05],
#         [3.3836e-05, 9.9994e-01, 2.4118e-05],
#         [4.7997e-05, 1.3496e-02, 9.8646e-01]])
#
# Audio x Text:
# tensor([[1., 0., 0.],
#         [0., 1., 0.],
#         [0., 0., 1.]])
#
# Vision x Audio:
# tensor([[0.8070, 0.1088, 0.0842],
#         [0.1036, 0.7884, 0.1079],
#         [0.0018, 0.0022, 0.9960]])


By executing the command directly, you should obtain an answer that is similar, if not identical, to the official example code. When we actually apply it, we can naturally adjust our text, images, etc. to get the desired embeddings.

Interestingly, if you look closely, you’ll find that the corresponding topic similarity scores are remarkably high, demonstrating the impressive capabilities of ImageBind. Of course, I did not simply run the official examples and conclude that this model is quite good; I even took a picture of my precious kitty, converted vectors for “dog”, “cat”, and “dragon”, and compared the similarities.

最後,cat 大概佔了 0.98 分,dog 跟 dragon 分了剩下的 0.02,而 dog 又比 dragon 多一些相似度。


Epilogue

Of course, Meta AI’s homepage mentions that this model cannot be used for commercial purposes, and it is also an immature model, so occasional mistakes are inevitable. They hope the open-source community will bear this in mind when using it.

However, this model really inspires many intriguing questions. For example, a colleague recently showed me the architecture of PandaGPT, which uses a Linear layer to transform ImageBind embedding information and simultaneously fine-tunes the language model with LoRA, giving the chatbot model, which could originally chat with us, the ability to handle other modalities.

This application is really fascinating! Moreover, on further thought, perhaps not only language models could be adjusted in this way – other types of models should be feasible too, right?

Therefore, there must be many more exciting applications like this!


References


Read More

Leave a Reply