Last Updated on 2021-12-31 by Clay
aitextgen is a great Python pacakge, it allows user just need to write a few codes to configure the complex AI model. Its architecture is used OpenAI's GPT-2 and EleutherAI's GPT Neo/GPT-3, and it can receive pre-trained model to fine-tune.
In below, I will briefly introduce how to use the aitextgen package.
Installation
First, we need to use pip
tool to install aitextgen:
pip3 install aitextgen
If we execute the following code and get some dependencies error, according to the error message to install.
Generation
The following is a sample code for text generating.
# coding: utf-8
from aitextgen import aitextgen
def main():
ai = aitextgen()
ai.generate(n=1, prompt="The dog", max_length=100)
if __name__ == "__main__":
main()
Output:
The dog was given a free medical attention certificate and is now in foster care.
"I think it was a wonderful experience for me," said Deacon. "I really don't want to spend too much time thinking about my dog in the same way.
"The dogs are so wonderful. They are the perfect companions and they are going to be a great part of my life.
"We have a great relationship and I'm very happy about the fact we are going to
Although the discussion is a bit strange, most of the generated grammar is correct.
Of course, we also have many parameters that can be adjusted.
Generation Parameters
n
: Number of text generatedmax_length
: The length of the generated text (default=200; GPT-2 is up to 1024; GPT-neo is up to 2048)prompt
: Prompt when starting to generate texttemperature
: Control how crazy the text is (default: 0.7)num_beams
: If it is greater than 1, perform beam search to generate clearer textrepetition_penalty
: If greater than 1.0, penalize repetitions in the text to avoid infinite loopslength_penalty
: If it is greater than 1.0, penalize text that is too longno_repeat_ngram_size
: Avoid given repeated short sentences
Generation Functions
Here we assume that the name of the aitextgen object is ai
:
(If you want to use GPU, you can use ai.to_gpu()
or ai = aitextgen(to_gpu=True)
)
ai.generate()
: Generate and print outai.generate_one()
: Generate a single text and return it as a stringai.generate_samples()
: Generate multiple samples at specified temperaturesai.generate_to_file()
: Generate large amounts of text and save to files
Load the model
The default load model is GPT-2 with 124M parameter scale.
But if you want to use other models, you can use:
ai = aitextgen(model="EleutherAI/gpt-neo-125M")
If you want to use other models, you can refer to the website of huggingface: https://huggingface.co/EleutherAI
Train the model
If you want to train a model, you can refer the official tutorial to download Shakespeare text and use the following sample code:
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
from aitextgen.utils import GPT2ConfigCPU
from aitextgen import aitextgen
# The name of the downloaded Shakespeare text for training
file_name = "input.txt"
# Train a custom BPE Tokenizer on the downloaded text
# This will save one file: `aitextgen.tokenizer.json`, which contains the
# information needed to rebuild the tokenizer.
train_tokenizer(file_name)
tokenizer_file = "aitextgen.tokenizer.json"
# GPT2ConfigCPU is a mini variant of GPT-2 optimized for CPU-training
# e.g. the # of input tokens here is 64 vs. 1024 for base GPT-2.
config = GPT2ConfigCPU()
# Instantiate aitextgen using the created tokenizer and config
ai = aitextgen(tokenizer_file=tokenizer_file, config=config)
# You can build datasets for training by creating TokenDatasets,
# which automatically processes the dataset with the appropriate size.
data = TokenDataset(file_name, tokenizer_file=tokenizer_file, block_size=64)
# Train the model! It will save pytorch_model.bin periodically and after completion to the `trained_model` folder.
# On a 2020 8-core iMac, this took ~25 minutes to run.
ai.train(data, batch_size=8, num_steps=50000, generate_every=5000, save_every=5000)
# Generate text from it!
ai.generate(10, prompt="ROMEO:")
# With your trained model, you can reload the model at any time by
# providing the folder containing the pytorch_model.bin model weights + the config, and providing the tokenizer.
ai2 = aitextgen(model_folder="trained_model",
tokenizer_file="aitextgen.tokenizer.json")
ai2.generate(10, prompt="ROMEO:")
Save the model
The model save method is very easy:
ai.save()