Skip to content

[NLP][Python] How to use NLTK package to process NLP tasks

Last Updated on 2021-04-07 by Clay

The full text of the NLTK is Nature Language Tool Kit, a package of natural language processing in Python.

Although Chinese can also be processed, but the support for Chinese is not as good as English, so today's examples are all handled by English corpus.

And, this website is the NLTK official website, you also can take a look for it: https://www.nltk.org

Trust me, It's better than me!

First, let's first explain some of the process items that nltk does for text preprocessing:

  • sentence segmentation
  • word segmentation
  • pos
  • lemmatization
  • stopword
  • ner

Of course, there are many types and aspects of pre-processing of text. Today, we simply take a few simple examples to explain.


Preparation

First, we need to use the following command to install NLTK pacakge:

pip3 install nltk

And then, if we want to use it in our program, we need to import this package in our head of program.

import nltk


If you get any error message like this:

Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

You can use nltk.download("punkt") to download it. (The example this time is punkt)


How to use NLTK package


Sentence Segmentation

Let's say we have a text to analyze:

I went to Japan. (NOT I went to the Japan.)
He played tennis with Ben. (NOT He played tennis with the Ben.)
They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.)
(Some words don't have an article. We don't usually use articles for countries, meals or people.)

We assign values to the variable text.

text = """I went to Japan. (NOT I went to the Japan.)
He played tennis with Ben. (NOT He played tennis with the Ben.)
They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.)
(Some words don't have an article. We don't usually use articles for countries, meals or people.)"""


We only need to call nltk's function to simply break the sentence:

sentences = nltk.sent_tokenize(text)


Output:

['I went to Japan.', '(NOT I went to the Japan.)', 'He played tennis with Ben.', '(NOT He played tennis with the Ben.)', 'They had breakfast at 9 o’clock.', "(NOT They had a breakfast at 9 o'clock.)", "(Some words don't have an article.", "We don't usually use articles for countries, meals or people.)"]


Word Segmentation

Considering the work that follows, the word segmentation must be processed from the output of the original sentences.

tokens = [nltk.tokenize.word_tokenize(sent) for sent in sentences]
for token in tokens:
    print(token)


Output:

['I', 'went', 'to', 'Japan', '.']
['(', 'NOT', 'I', 'went', 'to', 'the', 'Japan', '.', ')']
['He', 'played', 'tennis', 'with', 'Ben', '.']
['(', 'NOT', 'He', 'played', 'tennis', 'with', 'the', 'Ben', '.', ')']
['They', 'had', 'breakfast', 'at', '9', 'o', '’', 'clock', '.']
['(', 'NOT', 'They', 'had', 'a', 'breakfast', 'at', '9', "o'clock", '.', ')']
['(', 'Some', 'words', 'do', "n't", 'have', 'an', 'article', '.']
['We', 'do', "n't", 'usually', 'use', 'articles', 'for', 'countries', ',', 'meals', 'or', 'people', '.', ')']

This completes the work of basic word segmentation.
But this is not over yet, we have to consider the so-called "Lemmatization ", "stop word" and so on.


POS

Before proceeding with "Lemmatization" and "stop words", perhaps we should first do the work of POS (Part of speech). If our Lemmatization is done, our analyze of POS will have problem. In order to perform good Lemmatization, we also need the part of the word in the text. (We will see it in the sample code.)

So it's work on POS first.

pos = [nltk.pos_tag(token) for token in tokens]
for item in pos:
    print(item)


Output:

[('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('Japan', 'NNP'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('Japan', 'NNP'), ('.', '.'), (')', ')')]
[('He', 'PRP'), ('played', 'VBD'), ('tennis', 'NN'), ('with', 'IN'), ('Ben', 'NNP'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('He', 'PRP'), ('played', 'VBD'), ('tennis', 'NN'), ('with', 'IN'), ('the', 'DT'), ('Ben', 'NNP'), ('.', '.'), (')', ')')]
[('They', 'PRP'), ('had', 'VBD'), ('breakfast', 'NN'), ('at', 'IN'), ('9', 'CD'), ('o', 'JJ'), ('’', 'NN'), ('clock', 'NN'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('They', 'PRP'), ('had', 'VBD'), ('a', 'DT'), ('breakfast', 'NN'), ('at', 'IN'), ('9', 'CD'), ("o'clock", 'NN'), ('.', '.'), (')', ')')]
[('(', '('), ('Some', 'DT'), ('words', 'NNS'), ('do', 'VBP'), ("n't", 'RB'), ('have', 'VB'), ('an', 'DT'), ('article', 'NN'), ('.', '.')]
[('We', 'PRP'), ('do', 'VBP'), ("n't", 'RB'), ('usually', 'RB'), ('use', 'VB'), ('articles', 'NNS'), ('for', 'IN'), ('countries', 'NNS'), (',', ','), ('meals', 'NNS'), ('or', 'CC'), ('people', 'NNS'), ('.', '.'), (')', ')')]


Lemmatization

Lemmatization is relatively long, maybe I am not very fluent in coding, please also bear with me.

wordnet_pos = []
for p in pos:
    for word, tag in p:
        if tag.startswith('J'):
            wordnet_pos.append(nltk.corpus.wordnet.ADJ)
        elif tag.startswith('V'):
            wordnet_pos.append(nltk.corpus.wordnet.VERB)
        elif tag.startswith('N'):
            wordnet_pos.append(nltk.corpus.wordnet.NOUN)
        elif tag.startswith('R'):
            wordnet_pos.append(nltk.corpus.wordnet.ADV)
        else:
            wordnet_pos.append(nltk.corpus.wordnet.NOUN)

# Lemmatizer
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(p[n][0], pos=wordnet_pos[n]) for p in pos for n in range(len(p))]

for token in tokens:
    print(token)


Output:

I
go
to
Japan
.
(
NOT
I
went
to
the
Japan
.
)
He
play
tennis
with
Ben
.
(
NOT
He
played
tennis
with
the
Ben
.
)
They
have
breakfast
at
9
o
’
clock
.
(
NOT
They
had
a
breakfast
at
9
o'clock
.
)
(
Some
word
do
n't
have
an
article
.
We
do
n't
usually
use
article
for
country
,
meal
or
people
.
)


Stopword

The stop words are so straightforward. Basically a list of stop words that are imported into NLTK, and then use the for-loop to save the words that are not in the list of stop words.

nltk_stopwords = nltk.corpus.stopwords.words("english")
tokens = [token for token in tokens if token not in nltk_stopwords]
for token in tokens:
    print(token)


Output:

I
go
Japan
.
(
NOT
I
went
Japan
.
)
He
play
tennis
Ben
.
(
NOT
He
played
tennis
Ben
.
)
They
breakfast
9
’
clock
.
(
NOT
They
breakfast
9
o'clock
.
)
(
Some
word
n't
article
.
We
n't
usually
use
article
country
,
meal
people
.
)


NER

ne_chunked_sents = [nltk.ne_chunk(tag) for tag in pos]
named_entities = []

for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        if hasattr(tagged_tree, 'label'):
            entity_name = ' '.join(c[0] for c in tagged_tree.leaves())
            entity_type = tagged_tree.label()
            named_entities.append((entity_name, entity_type))
            named_entities = list(set(named_entities))

for ner in named_entities:
    print(ner)



Output:

('Ben', 'ORGANIZATION')
('Japan', 'GPE')
('Ben', 'PERSON')

Complete code

# coding: utf-8
import urllib.request
import ssl
import nltk

# Unverified context
ssl._create_default_https_context = ssl._create_unverified_context
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("wordnet")
nltk.download("stopwords")
nltk.download("maxent_ne_chunker")
nltk.download("words")



text = """I went to Japan. (NOT I went to the Japan.)
He played tennis with Ben. (NOT He played tennis with the Ben.)
They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.)
(Some words don't have an article. We don't usually use articles for countries, meals or people.)"""


# Sentences
sentences = nltk.sent_tokenize(text)


# Tokenize
tokens = [nltk.tokenize.word_tokenize(sent) for sent in sentences]


# POS
pos = [nltk.pos_tag(token) for token in tokens]


# Lemmatization
wordnet_pos = []
for p in pos:
    for word, tag in p:
        if tag.startswith('J'):
            wordnet_pos.append(nltk.corpus.wordnet.ADJ)
        elif tag.startswith('V'):
            wordnet_pos.append(nltk.corpus.wordnet.VERB)
        elif tag.startswith('N'):
            wordnet_pos.append(nltk.corpus.wordnet.NOUN)
        elif tag.startswith('R'):
            wordnet_pos.append(nltk.corpus.wordnet.ADV)
        else:
            wordnet_pos.append(nltk.corpus.wordnet.NOUN)

# Lemmatizer
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(p[n][0], pos=wordnet_pos[n]) for p in pos for n in range(len(p))]


# Stopwords
nltk_stopwords = nltk.corpus.stopwords.words("english")
tokens = [token for token in tokens if token not in nltk_stopwords]


# NER
ne_chunked_sents = [nltk.ne_chunk(tag) for tag in pos]
named_entities = []

for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        if hasattr(tagged_tree, 'label'):
            entity_name = ' '.join(c[0] for c in tagged_tree.leaves())
            entity_type = tagged_tree.label()
            named_entities.append((entity_name, entity_type))
            named_entities = list(set(named_entities))



References


Read more

Leave a Reply