Last Updated on 2024-07-20 by Clay
Introduction
In today's era of flourishing large language models, researchers and companies are racking their brains to apply these models to their work. However, speaking personally, the performance of current language models is still not strong enough, and their application scenarios are limited, often far less than that of humans.
But there is one type of task for which large language models are naturally quite suitable: information extraction in any scenario, which is what I want to introduce today, the NuExtract model.
Why specifically highlight information extraction? Because in traditional NLP, extracting information from text using models like RNN, LSTM, BERT (encoder-only Transformer) is challenging for arbitrary scenarios. In other words:
- If you want to extract information from a CT scan report, you need to fine-tune a CT scan report information extraction model.
- If you want to extract information from NBA news, you need to fine-tune an NBA news information extraction model.
- ... and so on.
In other words, if you don't have the data at hand, it is very difficult for the model to naturally understand how to extract information from that field. However, that's a thing of the past.
Large Language Models (LLM) are inherently capable of performing this task: During pre-training, they have seen a vast amount of text information, and their weights inherently store an understanding of the world as represented by text.
In this case, we can naturally use LLMs to perform the task of 'information extraction'—even if it requires some fine-tuning. As long as we train the model to not memorize but extract information based on 'literal' understanding (it is crucial to make the model try to generate text exactly like the original, otherwise it increases the likelihood of hallucinations), LLMs can likely extract different information across different domains with ease.
The NuExtract model I want to introduce today is such a model. While it is not perfect, it is at least a very, very good direction. I personally believe that better models will emerge in the future, and they will definitely be enhanced versions of the NuExtract I introduce today.
NuExtract is developed by a startup called NuMind and is open-sourced on HuggingFace. It is fine-tuned from Microsoft's Phi-3 model (NuExtract comes in several different scales, you can choose according to your needs). Let's take a look at how to use NuExtract and explore its potential.
Example Code
NuExtract is very easy to use. You can load it using the AutoModelForCausalLM
from the transformers
library, and by defining text
and schema
, you can generate answers according to the template defined during training.
Here, the text
I selected is one of the articles from HuggingFace's Daily Paper today, and the schema
is set with items like date, author, title, etc. that need to be extracted.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda:0"
pretrained_model_name_or_path = "numind/NuExtract"
model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, trust_remote_code=True)
text = """LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages
Published on Jul 8
·
Submitted by
FeYuan
on Jul 9
#2 Paper of the day
Authors:
Yinquan Lu
,
Wenhao Zhu
,
Lei Li
,
Yu Qiao
,
Fei Yuan
Abstract
Large Language Models~(LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we dedicate 35,000 A100-SXM4-80GB GPU hours in conducting extensive multilingual continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs~(by more than 10 spBLEU points) and performs on-par with specialized translation model~(M2M-100-12B) on the Flores-101 benchmark. Extensive experiments indicate that LLaMAX can serve as a robust multilingual foundation model."""
schema = """{
"Date": "",
"Author": [],
"Title": "",
"Categories": "",
}"""
input_llm = """<|input|>
### Template:
{schema}
### Text:
{text}
<|output|>
"""
input_ids = tokenizer(input_llm.format(schema=schema, text=text), return_tensors="pt", truncation=True, max_length=4000).to(device)
print(tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True).split("<|output|>\n")[-1])
Output:
{
"Date": "Jul 8",
"Author": [
"Yinquan Lu",
"Wenhao Zhu",
"Lei Li",
"Yu Qiao",
"Fei Yuan"
],
"Title": "Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages",
"Categories": ""
}
I did not use the largest model; it has only 3.8B parameters, but it has already performed quite well. Of course, we can also modify the schema
to force it to select at least one category
.
schema = """{
"Date": "",
"Author": [],
"Title": "",
"Categories": "" (Select from ["AI, "BIO", "FinTech"]),
}"""
input_llm = """<|input|>
### Template:
{schema}
### Text:
{text}
<|output|>
"""
input_ids = tokenizer(input_llm.format(schema=schema, text=text), return_tensors="pt", truncation=True, max_length=4000).to(device)
print(tokenizer.decode(model.generate(**input_ids)[0], skip_special_tokens=True).split("<|output|>\n")[-1])
Output:
{
"Date": "Jul 8",
"Author": [
"Yinquan Lu",
"Wenhao Zhu",
"Lei Li",
"Yu Qiao",
"Fei Yuan"
],
"Title": "Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages",
"Categories": "AI"
}
The biggest advantage of using LLMs for information extraction is that they do not require much fine-tuning, and they can even extract the information you want from articles directly without being limited to specific domains.
Of course, there are still shortcomings with the current models, but they are already powerful tools that can be used today. This is one of the practical application scenarios of LLMs.