Skip to content

[Python] Convert Glove model to a format Gensim can read

Python is the most popular programming language!

Introduction

Those who are familiar with natural language processing (NLP) must be familiar with Glove and Python package Gensim.

GloveGlobal Vectors for Word Representation)is a paper published by Stanford NLP Group, and it is also an open source pre-trained word embedding model. The Glove that you often see on the Internet now refers to this open source pre-trained model.

Gensim is a Python implementation of the Word2Vec paper proposed by Google in 2013, allowing us to easily train the word vector model using our own corpus via this package.

We will enter the topic of today’s article: how do we use Glove in Python? The Glove downloaded from the official website can not be read by Gensim.

So, we need to use the built-in function of Gensim to perform the conversion. The following is a step-by-step record of how to convert the Glove model into a format that Gensim cna read.


Using Gensim to convert Glove model

First we need to download Glove: https://nlp.stanford.edu/projects/glove/

Or you can use the following command:

wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip
rm glove.6B.zip

If you finished, you may see several different dimension Glove model.

And then, we can use the following code to convert these model. If you are using Gensim package in the first time, you need to use pip3 install gensim to install it.

# coding: utf-8
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec


# Convert
input_file = 'glove.6B.300d.txt'
output_file = 'gensim_glove.6B.300d.txt'
glove2word2vec(input_file, output_file)


# Test Glove model
model = KeyedVectors.load_word2vec_format(output_file, binary=False)
word = 'cat'
print(word)
print('Most similar:\n{}'.format(model.most_similar(word)))


Output:

cat
Most similar:
[('dog', 0.6816747188568115),
 ('cats', 0.6815836429595947),
 ('pet', 0.5870364904403687),
 ('dogs', 0.540766716003418),
 ('feline', 0.48979705572128296),
 ('monkey', 0.48794347047805786),
 ('horse', 0.4732130467891693),
 ('pets', 0.4634858965873718),
 ('rabbit', 0.4608757495880127),
 ('leopard', 0.4585462808609009)]

As we can see, the Glove model can read by Gensim right now.


References


Read More

Tags:

Leave a Reply