Last Updated on 2021-01-02 by Clay
Introduction
Those who are familiar with natural language processing (NLP) must be familiar with Glove and Python package Gensim.
Glove(Global Vectors for Word Representation)is a paper published by Stanford NLP Group, and it is also an open source pre-trained word embedding model. The Glove that you often see on the Internet now refers to this open source pre-trained model.
Gensim is a Python implementation of the Word2Vec paper proposed by Google in 2013, allowing us to easily train the word vector model using our own corpus via this package.
We will enter the topic of today's article: how do we use Glove in Python? The Glove downloaded from the official website can not be read by Gensim.
So, we need to use the built-in function of Gensim to perform the conversion. The following is a step-by-step record of how to convert the Glove model into a format that Gensim cna read.
Using Gensim to convert Glove model
First we need to download Glove: https://nlp.stanford.edu/projects/glove/
Or you can use the following command:
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip
rm glove.6B.zip
If you finished, you may see several different dimension Glove model.
And then, we can use the following code to convert these model. If you are using Gensim package in the first time, you need to use pip3 install gensim to install it.
# coding: utf-8
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
# Convert
input_file = 'glove.6B.300d.txt'
output_file = 'gensim_glove.6B.300d.txt'
glove2word2vec(input_file, output_file)
# Test Glove model
model = KeyedVectors.load_word2vec_format(output_file, binary=False)
word = 'cat'
print(word)
print('Most similar:\n{}'.format(model.most_similar(word)))
Output:
cat
Most similar:
[('dog', 0.6816747188568115),
('cats', 0.6815836429595947),
('pet', 0.5870364904403687),
('dogs', 0.540766716003418),
('feline', 0.48979705572128296),
('monkey', 0.48794347047805786),
('horse', 0.4732130467891693),
('pets', 0.4634858965873718),
('rabbit', 0.4608757495880127),
('leopard', 0.4585462808609009)]
As we can see, the Glove model can read by Gensim right now.
References
- https://www.aclweb.org/anthology/D14-1162/
- https://nlp.stanford.edu/projects/glove/
- https://code.google.com/archive/p/word2vec/