Skip to content

[NLP] Use aitextgen Python Package To Generate Text

Last Updated on 2021-12-31 by Clay

aitextgen is a great Python pacakge, it allows user just need to write a few codes to configure the complex AI model. Its architecture is used OpenAI's GPT-2 and EleutherAI's GPT Neo/GPT-3, and it can receive pre-trained model to fine-tune.

In below, I will briefly introduce how to use the aitextgen package.


Installation

First, we need to use pip tool to install aitextgen:

pip3 install aitextgen


If we execute the following code and get some dependencies error, according to the error message to install.


Generation

The following is a sample code for text generating.

# coding: utf-8
from aitextgen import aitextgen


def main():
    ai = aitextgen()
    ai.generate(n=1, prompt="The dog", max_length=100)


if __name__ == "__main__":
    main()


Output:

The dog was given a free medical attention certificate and is now in foster care.

"I think it was a wonderful experience for me," said Deacon. "I really don't want to spend too much time thinking about my dog in the same way.

"The dogs are so wonderful. They are the perfect companions and they are going to be a great part of my life.

"We have a great relationship and I'm very happy about the fact we are going to


Although the discussion is a bit strange, most of the generated grammar is correct.

Of course, we also have many parameters that can be adjusted.

Generation Parameters

  • n: Number of text generated
  • max_length: The length of the generated text (default=200; GPT-2 is up to 1024; GPT-neo is up to 2048)
  • prompt: Prompt when starting to generate text
  • temperature: Control how crazy the text is (default: 0.7)
  • num_beams: If it is greater than 1, perform beam search to generate clearer text
  • repetition_penalty: If greater than 1.0, penalize repetitions in the text to avoid infinite loops
  • length_penalty: If it is greater than 1.0, penalize text that is too long
  • no_repeat_ngram_size: Avoid given repeated short sentences


Generation Functions

Here we assume that the name of the aitextgen object is ai:

(If you want to use GPU, you can use ai.to_gpu() or ai = aitextgen(to_gpu=True))

  • ai.generate(): Generate and print out
  • ai.generate_one(): Generate a single text and return it as a stringai.generate_samples(): Generate multiple samples at specified temperatures
  • ai.generate_to_file(): Generate large amounts of text and save to files

Load the model

The default load model is GPT-2 with 124M parameter scale.

But if you want to use other models, you can use:

ai = aitextgen(model="EleutherAI/gpt-neo-125M")


If you want to use other models, you can refer to the website of huggingface: https://huggingface.co/EleutherAI


Train the model

If you want to train a model, you can refer the official tutorial to download Shakespeare text and use the following sample code:

from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
from aitextgen import aitextgen

# The name of the downloaded Shakespeare text for training
file_name = "input.txt"

# Train a custom BPE Tokenizer on the downloaded text
# This will save one file: `aitextgen.tokenizer.json`, which contains the
# information needed to rebuild the tokenizer.
train_tokenizer(file_name)
tokenizer_file = "aitextgen.tokenizer.json"

# GPT2ConfigCPU is a mini variant of GPT-2 optimized for CPU-training
# e.g. the # of input tokens here is 64 vs. 1024 for base GPT-2.
config = GPT2ConfigCPU()

# Instantiate aitextgen using the created tokenizer and config
ai = aitextgen(tokenizer_file=tokenizer_file, config=config)

# You can build datasets for training by creating TokenDatasets,
# which automatically processes the dataset with the appropriate size.
data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)

# Train the model! It will save pytorch_model.bin periodically and after completion to the `trained_model` folder.
# On a 2020 8-core iMac, this took ~25 minutes to run.
ai.train(data, batch_size=8, num_steps=50000, generate_every=5000, save_every=5000)

# Generate text from it!
ai.generate(10, prompt="ROMEO:")

# With your trained model, you can reload the model at any time by
# providing the folder containing the pytorch_model.bin model weights + the config, and providing the tokenizer.
ai2 = aitextgen(model_folder="trained_model",
                tokenizer_file="aitextgen.tokenizer.json")

ai2.generate(10, prompt="ROMEO:")



Save the model

The model save method is very easy:

ai.save()



References


Read More

Leave a Reply