Skip to content

[PyTorch] How to Use HuggingFace Transformers Package (With BERT Example)

Last Updated on 2021-10-27 by Clay

At the end of 2018, the transformer model BERT occupied the rankings of major NLP competitions, and performed quite well. I have been interested in transform models such as BERT, so today I started to record how to use the transformers package developed by HuggingFace.

This article focuses less on the principles of transformer model, and focuses more on how to use the transformers package.

Although there is a Tensorflow version of this package, but this note will use PyTorch as a demonstration.


Transformer Model

First of all, BERT is a type of Transformer model, but what exactly is a Transformer model? Basically, we can understand that the Transformer model architecture is divided into Encoder and Decoder.

However, since BERT is a language model that converts words (tokens) into feature representations, what we usually use is actually the Encoder in the Transformer model.


How to use BERT

I just briefly explained that BERT is a type of Transformer model. Now I introduce what BERT is.

BERT (Bidirectional Encoder Representations from Transformers) is a paper published by Google researchers and proves that the language model of bidirectional training is better than one-direction.

So how do we use BERT at our downstream tasks?

First, we need to install the transformers package developed by HuggingFace team:

pip3 install transformers


If there is no PyTorch and Tensorflow in your environment, maybe occur some core ump problem when using transformers package. So I recommend you have to install them.

To use BERT to convert words into feature representations, we need to convert words into indices, and padding the sentence to the same length.

This is the sample code:

# coding: utf-8
import torch
from transformers import AutoTokenizer, AutoModel
from keras.preprocessing.sequence import pad_sequences


# Tokenizer and Bert Model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
embedding = AutoModel.from_pretrained('bert-base-uncased')


# Preprocess
sent = 'Today is a nice day'
sent_token = tokenizer.encode(sent)
sent_token_padding = pad_sequences([sent_token], maxlen=10, padding='post', dtype='int')
masks = [[float(value>0) for value in values] for values in sent_token_padding]

print('sent:', sent)
print('sent_token:', sent_token)
print('sent_token_padding:', sent_token_padding)
print('masks:', masks)
print('\n')


# Convert
inputs = torch.tensor(sent_token_padding)
masks = torch.tensor(masks)
embedded = embedding(inputs, attention_mask=masks)
print('embedded shape:', embedded[0].shape)


Output:

sent: Today is a nice day
sent_token: [101, 2651, 2003, 1037, 3835, 2154, 102]
sent_token_padding: [[ 101 2651 2003 1037 3835 2154  102    0    0    0]]
masks: [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0]]


embedded shape: torch.Size([1, 10, 768])


First at all, we need to initial the Tokenizer and Model, in here we select the pre-trained model bert-base-uncased.

Then, I use tokenizer.encode() to encode my sentence into the indices required in BERT. Each index corresponds to a token, with [CLS] at the left and [SEP] at the right. It is the input format required by BERT.

After all data converted to the torch.tensor type, input to embedding variable (it is the BERT model) to get the final output.

In addition to use encode(), you can also use convert_token_to_ids() to convert convert_token_to_ids() allows us to put in the context at once, and use the [SEP] symbol to separate.

The following is a simple example:

# coding: utf-8
import torch
from transformers import AutoTokenizer, AutoModel
from keras.preprocessing.sequence import pad_sequences


# Tokenizer and Bert Model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
embedding = AutoModel.from_pretrained('bert-base-uncased')


# Preprocess
sent = 'Today is a nice day'
sent_token = ['[CLS]'] + tokenizer.tokenize(sent) + ['[SEP]']
sent_token_encode = tokenizer.convert_tokens_to_ids(sent_token)
sent_token_decode = tokenizer.convert_ids_to_tokens(sent_token_encode)

print('sent:', sent)
print('sent_token:', sent_token)
print('encode:', sent_token_encode)
print('decode:', sent_token_decode)


Output:

sent: Today is a nice day
sent_token: ['[CLS]', 'today', 'is', 'a', 'nice', 'day', '[SEP]']
encode: [101, 2651, 2003, 1037, 3835, 2154, 102]
decode: ['[CLS]', 'today', 'is', 'a', 'nice', 'day', '[SEP]']


In addition to encoding, you can also decode back to the string.

This is the basic usage of transformers package. If you want to try more different models, you can refer to the following URL: https://huggingface.co/models


References


Read More

Leave a Reply