[PyTorch] Use "Embedding" Layer To Process Text

Embedding in the field of NLP usually refers to the action of converting text to numerical value. After all, text is discontinuous data and it can not be processed by computer.

The following is just my personal understanding: For example, today we have a sentence:

Today is a nice day.

Then we can convert the words of this sentence to some indices.

Today	1
is	2
a	3
nice	4
day	5

Just like this, a word turns into a number

Then, we can express this sentence in the following format:

[1, 2, 3, 4, 5]

In this way, you can apply the deep learning framework for training, but I guess the effect should be quite poor.

After all, this conversion seems to be inferior to one-hot encoding.

The topic is a bit far off.

Today I want to record how to use embedding layer in PyTorch framework and convert our text data into another numerical data.

nn.Embedding of PyTorch

First, we take a look of official document.

nn.Embedding roughly has the following parameters:

num_embedding: The number of all text vocabulary indexes
embedding_dim: How many dimensions of a word should be converted into a
vectorpadding_idx: If there is a value, then if the number of words is not enough, the value you set will be used for padding, so that each input maintains the same size
max_norm: Normalized maximum value
sparse: If True, the vector is sparse

Basically num_embedding and embedding_dim are the two most important parameters. Suppose we have the following program:

embedding = nn.Embedding(1000, 3, padding_idx=0)
inputs = torch.tensor([1, 2, 3, 4, 5])
print(embedding(inputs))

embedding = nn.Embedding(1000, 3, padding_idx=0)
inputs = torch.tensor([1, 2, 3, 4, 5])
print(embedding(inputs))

Output:

tensor([[-0.3296, 0.6558, -1.4805],
       [-0.1493, -0.5477, 0.6713],
       [ 0.4637, 1.3709, 0.2004],
       [ 0.2457, -1.4567, -0.4856],
       [-0.9163, 0.6130, -1.1636]], grad_fn=<EmbeddingBackward>)

We will get a dimension of each vocabulary (each Index), and the size of this vector is the embedding_dim we set.

In this way, we will generally train the model of our task while improving the conversion vector of the Embedding Layer to improve the effect.

In addition to letting the Embedding Layer train itself, you can also directly assign models such as Glove, Word2Vec, etc., so that our Embedding Layer can be adjusted at a better starting point.

[PyTorch] Use “Embedding” Layer To Process Text

nn.Embedding of PyTorch

References

Read More

Related

Leave a ReplyCancel reply

[PyTorch] Use “Embedding” Layer To Process Text

nn.Embedding of PyTorch

References

Read More

Share this:

Related

Leave a ReplyCancel reply