Embedding in the field of NLP usually refers to the action of converting text to numerical value. After all, text is discontinuous data and it can not be processed by computer.
The following is just my personal understanding: For example, today we have a sentence:
Today is a nice day.
Then we can convert the words of this sentence to some indices.
Today | 1 |
is | 2 |
a | 3 |
nice | 4 |
day | 5 |
Then, we can express this sentence in the following format:
[1, 2, 3, 4, 5]
In this way, you can apply the deep learning framework for training, but I guess the effect should be quite poor.
After all, this conversion seems to be inferior to one-hot encoding.
The topic is a bit far off.
Today I want to record how to use embedding layer in PyTorch framework and convert our text data into another numerical data.
nn.Embedding of PyTorch
First, we take a look of official document.
nn.Embedding roughly has the following parameters:
- num_embedding: The number of all text vocabulary indexes
- embedding_dim: How many dimensions of a word should be converted into a
- vectorpadding_idx: If there is a value, then if the number of words is not enough, the value you set will be used for padding, so that each input maintains the same size
- max_norm: Normalized maximum value
- sparse: If True, the vector is sparse
Basically num_embedding
and embedding_dim
are the two most important parameters. Suppose we have the following program:
embedding = nn.Embedding(1000, 3, padding_idx=0) inputs = torch.tensor([1, 2, 3, 4, 5]) print(embedding(inputs))
Output:
tensor([[-0.3296, 0.6558, -1.4805],
[-0.1493, -0.5477, 0.6713],
[ 0.4637, 1.3709, 0.2004],
[ 0.2457, -1.4567, -0.4856],
[-0.9163, 0.6130, -1.1636]], grad_fn=<EmbeddingBackward>)
We will get a dimension of each vocabulary (each Index), and the size of this vector is the embedding_dim we set.
In this way, we will generally train the model of our task while improving the conversion vector of the Embedding Layer to improve the effect.
In addition to letting the Embedding Layer train itself, you can also directly assign models such as Glove, Word2Vec, etc., so that our Embedding Layer can be adjusted at a better starting point.
References
- https://pytorch.org/docs/stable/nn.html?highlight=embeddingbag#torch.nn.EmbeddingBag
- https://discuss.pytorch.org/t/how-should-i-understand-the-num-embeddings-and-embedding-dim-arguments-for-nn-embedding/60442