Last Updated on 2021-10-13 by Clay
If you are using PyTorch to processing NLP tasks, you must be familiar with nn.Embedding()
in PyTorch.
nn.Embedding()
is an embedding layer in PyTorch, which allows us to put in different word numbers and generate a set of vector return that we can arbitrarily specify.
After converting from text to vector, we can start training our model. After all, a computer is a device that can only process on numbers.
Gensim is a python implementation of Word2Vec published by Google in 2013, allowing us to train a pre-trained model that converts text into vector through CBOW or skip-gram. As far as I know, the effect of using pre-trained models is often better than setting nn.Embedding()
directly in most tasks.
The purpose of this article is to record “how to use nn.Embedding()
to directly load Gensim’s pre-trained model”.
In this way, we can have a better starting point through the pre-trained model, and fine-tune it according to the different tasks we deal with, hoping to get a better model effect.
How To Use nn.Embedding() To Load Gensim Model Weights
First, we need a pre-trained Gensim model. The following assumes that word2vec_pretrain_v300.model
is the pre-trained model.
# coding: utf-8
import gensim
import torch
import torch.nn as nn
# Load word2vec pre-train model
model = gensim.models.Word2Vec.load('./word2vec_pretrain_v300.model')
weights = torch.FloatTensor(model.wv.vectors)
# Build nn.Embedding() layer
embedding = nn.Embedding.from_pretrained(weights)
embedding.requires_grad = False
# Query
query = '天氣'
query_id = torch.tensor(model.wv.vocab['天氣'].index)
gensim_vector = torch.tensor(model[query])
embedding_vector = embedding(query_id)
print(gensim_vector==embedding_vector)
Output:
First, load in Gensim’s pre-trained model, and convert its vector into the data format Tensor required by PyTorch, as the initial value of nn.Embedding()
.
There is a small tip: if you don’t plan to train nn.Embedding()
together during model training, remember to set it to requires_grad = False
.
The remaining steps are easy.
Extract the Index of the vocabulary in the Gensim pre-trained model, convert it to Tensor, and input nn.Embedding()
to get a trained 300-dimensional vector.
Finally, I did a small experiment to confirm that the “weather” vector returned by the Gensim model and nn.Embedding() are the same.
In addition, when actually using the nn.Embedding model layer, you still have to pay attention to the so-called “unknown word”, which will not automatically help processing in nn.Embedding, so we need to manually read the weights Add a vector of unknown words to see if you want to fill in an average vector or a zero vector, and then when encoding the vocabulary, the vocabulary that is not in the pre-training vocabulary is numbered as the number of the unknown word.
References
- https://discuss.pytorch.org/t/expected-input-to-torch-embedding-layer-with-pre-trained-vectors-from-gensim/37029
- https://stackoverflow.com/questions/49710537/pytorch-gensim-how-to-load-pre-trained-word-embeddings
- https://medium.com/@rohit_agrawal/using-fine-tuned-gensim-word2vec-embeddings-with-torchtext-and-pytorch-17eea2883cd
- https://www.tfzx.net/article/2710435.html