Last Updated on 2021-03-07 by Clay
Introduction
“Word Embedding” is a technology that is often used in natural language processing (NLP), and its concept is convert text into numerical format (numbers).
Why do you need to make a conversion?
For example, in a neural network model, we can not calculate weights via “text“, so we need to convert it to “numbers” that computers can understand.
There are many methods to build a word embedding model, so you need to consider your task and device to make your choice.
There are probably the following types that we often see:
- One-hot encoding
- Word2Vec
- Doc2Vec
- Glove
- FastText
- ELMO
- GPT
- BERT
I will briefly introduce what these are.
One-hot encoding
One-hot encoding is the first way I learned how to convert text to numbers. It’s concept is very simple.
We labelled the word as its index in the sequence. For example:
Today is a nice day
We defined the sentence has 5 words:
['Today', 'is', 'a', 'nice', 'day']
So we can convert them to the following format:
Today | [1, 0, 0, 0, 0] |
is | [0, 1, 0, 0, 0] |
a | [0, 0, 1, 0, 0] |
nice | [0, 0, 0, 1, 0] |
day | [0, 0, 0, 0, 1] |
This is a simple way but if we have a large text our memory will explode.
Word2Vec
Paper link: https://arxiv.org/pdf/1301.3781.pdf
Word2Vec is a open source tool developed by Google, and its Python implement tool is named Gensim. It has two algorithm skip-gram and CBOW to train the model.
It just only use CPU but its training is fast. Because it will mapping a word to a low-dimension domain (convert word to a vector), so it don’t have defect like one-hot encoding.
But I think the future task will tend to use transformer encoder model (like BERT, XLNet) to solved.
Doc2Vec
Paper link: https://cs.stanford.edu/~quocle/paragraph_vector.pdf
Doc2Vec is similar to Word2Vec. If we claim that Word2Vec converts word to vector, then Doc2Vec is convert document to vector.
Glove
It is a famous tool developed by Stanford group, you can go to their official website: https://nlp.stanford.edu/projects/glove/
Glove is a way to get word vector, too. The most of English word embedding pretrained-model is download from website: https://github.com/stanfordnlp/GloVe
FastText
FastText is a tool open sourced by Facebook. In addition to NLP libraries, there are also have some machine learning libraries.
The fastest way to learn this tool, I suggest to go to the official website to see the guide: https://fasttext.cc/
ELMO
ELMO was published in March 2018. Unlike Word2Vec model, ELMO can give different vectors according to different situations, that’s make it useful than many fixed word vector model.
GPT
BERT
GPT (or GPT-2) and BERT I want to discuss them together.
BERT is a tool published by Google in November 2018. The full name is Bidrectional Encoder Representation from Transformers. As the name suggests, this is a model composition of Transformer architecture.
In fact, BERT is used in the word embedding tasks. So it can convert a word to a vector, is a ENCODER in the Transformer architecture.
GPT-2’s output is a word, or you call it A TOKEN. So it is a DECODER in the Transformer.
If you want to use these tool about Transformer, I recommend a useful package “transformers” for you (the package is developed by HuggingFace group): https://github.com/huggingface/transformers
In the case of BERT, I can speak a short story.
I have used LSTM with Word2Vec to classify the reviews of the classic data set IMDB. At that time, it was so sad that I might not have adjusted it properly, the highest accuracy is only 87%.
And my friend, he just using only scikit-learn and reached accuracy 90% ! That is really unfair.
Angry me, try to use BERT as an embedding model in my classify task. A magical thing happened! my accuracy was as high as 93.59%!
It can be seen that this is really a very powerful tool.
Above, the simple Word Embedding is roughly recorded here. Contains some of the experience I used in the past, as well as the characteristics of each of these different Embedding methods.
Of course, implementation is the most important. I will continue to learn, practice different models and challenge different machine learning tasks in the future.