Last Updated on 2021-04-07 by Clay
The full text of the NLTK is Nature Language Tool Kit, a package of natural language processing in Python.
Although Chinese can also be processed, but the support for Chinese is not as good as English, so today's examples are all handled by English corpus.
And, this website is the NLTK official website, you also can take a look for it: https://www.nltk.org
Trust me, It's better than me!
First, let's first explain some of the process items that nltk does for text preprocessing:
- sentence segmentation
- word segmentation
- pos
- lemmatization
- stopword
- ner
Of course, there are many types and aspects of pre-processing of text. Today, we simply take a few simple examples to explain.
Preparation
First, we need to use the following command to install NLTK pacakge:
pip3 install nltk
And then, if we want to use it in our program, we need to import this package in our head of program.
import nltk
If you get any error message like this:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
You can use nltk.download("punkt")
to download it. (The example this time is punkt)
How to use NLTK package
Sentence Segmentation
Let's say we have a text to analyze:
I went to Japan. (NOT I went to the Japan.)
He played tennis with Ben. (NOT He played tennis with the Ben.)
They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.)
(Some words don't have an article. We don't usually use articles for countries, meals or people.)
We assign values to the variable text.
text = """I went to Japan. (NOT I went to the Japan.) He played tennis with Ben. (NOT He played tennis with the Ben.) They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.) (Some words don't have an article. We don't usually use articles for countries, meals or people.)"""
We only need to call nltk's function to simply break the sentence:
sentences = nltk.sent_tokenize(text)
Output:
['I went to Japan.', '(NOT I went to the Japan.)', 'He played tennis with Ben.', '(NOT He played tennis with the Ben.)', 'They had breakfast at 9 o’clock.', "(NOT They had a breakfast at 9 o'clock.)", "(Some words don't have an article.", "We don't usually use articles for countries, meals or people.)"]
Word Segmentation
Considering the work that follows, the word segmentation must be processed from the output of the original sentences.
tokens = [nltk.tokenize.word_tokenize(sent) for sent in sentences] for token in tokens: print(token)
Output:
['I', 'went', 'to', 'Japan', '.']
['(', 'NOT', 'I', 'went', 'to', 'the', 'Japan', '.', ')']
['He', 'played', 'tennis', 'with', 'Ben', '.']
['(', 'NOT', 'He', 'played', 'tennis', 'with', 'the', 'Ben', '.', ')']
['They', 'had', 'breakfast', 'at', '9', 'o', '’', 'clock', '.']
['(', 'NOT', 'They', 'had', 'a', 'breakfast', 'at', '9', "o'clock", '.', ')']
['(', 'Some', 'words', 'do', "n't", 'have', 'an', 'article', '.']
['We', 'do', "n't", 'usually', 'use', 'articles', 'for', 'countries', ',', 'meals', 'or', 'people', '.', ')']
This completes the work of basic word segmentation.
But this is not over yet, we have to consider the so-called "Lemmatization ", "stop word" and so on.
POS
Before proceeding with "Lemmatization" and "stop words", perhaps we should first do the work of POS (Part of speech). If our Lemmatization is done, our analyze of POS will have problem. In order to perform good Lemmatization, we also need the part of the word in the text. (We will see it in the sample code.)
So it's work on POS first.
pos = [nltk.pos_tag(token) for token in tokens] for item in pos: print(item)
Output:
[('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('Japan', 'NNP'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('Japan', 'NNP'), ('.', '.'), (')', ')')]
[('He', 'PRP'), ('played', 'VBD'), ('tennis', 'NN'), ('with', 'IN'), ('Ben', 'NNP'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('He', 'PRP'), ('played', 'VBD'), ('tennis', 'NN'), ('with', 'IN'), ('the', 'DT'), ('Ben', 'NNP'), ('.', '.'), (')', ')')]
[('They', 'PRP'), ('had', 'VBD'), ('breakfast', 'NN'), ('at', 'IN'), ('9', 'CD'), ('o', 'JJ'), ('’', 'NN'), ('clock', 'NN'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('They', 'PRP'), ('had', 'VBD'), ('a', 'DT'), ('breakfast', 'NN'), ('at', 'IN'), ('9', 'CD'), ("o'clock", 'NN'), ('.', '.'), (')', ')')]
[('(', '('), ('Some', 'DT'), ('words', 'NNS'), ('do', 'VBP'), ("n't", 'RB'), ('have', 'VB'), ('an', 'DT'), ('article', 'NN'), ('.', '.')]
[('We', 'PRP'), ('do', 'VBP'), ("n't", 'RB'), ('usually', 'RB'), ('use', 'VB'), ('articles', 'NNS'), ('for', 'IN'), ('countries', 'NNS'), (',', ','), ('meals', 'NNS'), ('or', 'CC'), ('people', 'NNS'), ('.', '.'), (')', ')')]
Lemmatization
Lemmatization is relatively long, maybe I am not very fluent in coding, please also bear with me.
wordnet_pos = [] for p in pos: for word, tag in p: if tag.startswith('J'): wordnet_pos.append(nltk.corpus.wordnet.ADJ) elif tag.startswith('V'): wordnet_pos.append(nltk.corpus.wordnet.VERB) elif tag.startswith('N'): wordnet_pos.append(nltk.corpus.wordnet.NOUN) elif tag.startswith('R'): wordnet_pos.append(nltk.corpus.wordnet.ADV) else: wordnet_pos.append(nltk.corpus.wordnet.NOUN) # Lemmatizer lemmatizer = nltk.stem.wordnet.WordNetLemmatizer() tokens = [lemmatizer.lemmatize(p[n][0], pos=wordnet_pos[n]) for p in pos for n in range(len(p))] for token in tokens: print(token)
Output:
I
go
to
Japan
.
(
NOT
I
went
to
the
Japan
.
)
He
play
tennis
with
Ben
.
(
NOT
He
played
tennis
with
the
Ben
.
)
They
have
breakfast
at
9
o
’
clock
.
(
NOT
They
had
a
breakfast
at
9
o'clock
.
)
(
Some
word
do
n't
have
an
article
.
We
do
n't
usually
use
article
for
country
,
meal
or
people
.
)
Stopword
The stop words are so straightforward. Basically a list of stop words that are imported into NLTK, and then use the for-loop to save the words that are not in the list of stop words.
nltk_stopwords = nltk.corpus.stopwords.words("english") tokens = [token for token in tokens if token not in nltk_stopwords] for token in tokens: print(token)
Output:
I
go
Japan
.
(
NOT
I
went
Japan
.
)
He
play
tennis
Ben
.
(
NOT
He
played
tennis
Ben
.
)
They
breakfast
9
’
clock
.
(
NOT
They
breakfast
9
o'clock
.
)
(
Some
word
n't
article
.
We
n't
usually
use
article
country
,
meal
people
.
)
NER
ne_chunked_sents = [nltk.ne_chunk(tag) for tag in pos] named_entities = [] for ne_tagged_sentence in ne_chunked_sents: for tagged_tree in ne_tagged_sentence: if hasattr(tagged_tree, 'label'): entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) entity_type = tagged_tree.label() named_entities.append((entity_name, entity_type)) named_entities = list(set(named_entities)) for ner in named_entities: print(ner)
Output:
('Ben', 'ORGANIZATION')
('Japan', 'GPE')
('Ben', 'PERSON')
Complete code
# coding: utf-8 import urllib.request import ssl import nltk # Unverified context ssl._create_default_https_context = ssl._create_unverified_context nltk.download("punkt") nltk.download("averaged_perceptron_tagger") nltk.download("wordnet") nltk.download("stopwords") nltk.download("maxent_ne_chunker") nltk.download("words") text = """I went to Japan. (NOT I went to the Japan.) He played tennis with Ben. (NOT He played tennis with the Ben.) They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.) (Some words don't have an article. We don't usually use articles for countries, meals or people.)""" # Sentences sentences = nltk.sent_tokenize(text) # Tokenize tokens = [nltk.tokenize.word_tokenize(sent) for sent in sentences] # POS pos = [nltk.pos_tag(token) for token in tokens] # Lemmatization wordnet_pos = [] for p in pos: for word, tag in p: if tag.startswith('J'): wordnet_pos.append(nltk.corpus.wordnet.ADJ) elif tag.startswith('V'): wordnet_pos.append(nltk.corpus.wordnet.VERB) elif tag.startswith('N'): wordnet_pos.append(nltk.corpus.wordnet.NOUN) elif tag.startswith('R'): wordnet_pos.append(nltk.corpus.wordnet.ADV) else: wordnet_pos.append(nltk.corpus.wordnet.NOUN) # Lemmatizer lemmatizer = nltk.stem.wordnet.WordNetLemmatizer() tokens = [lemmatizer.lemmatize(p[n][0], pos=wordnet_pos[n]) for p in pos for n in range(len(p))] # Stopwords nltk_stopwords = nltk.corpus.stopwords.words("english") tokens = [token for token in tokens if token not in nltk_stopwords] # NER ne_chunked_sents = [nltk.ne_chunk(tag) for tag in pos] named_entities = [] for ne_tagged_sentence in ne_chunked_sents: for tagged_tree in ne_tagged_sentence: if hasattr(tagged_tree, 'label'): entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) entity_type = tagged_tree.label() named_entities.append((entity_name, entity_type)) named_entities = list(set(named_entities))