Skip to content

[NLP][Python] 英文自然語言處理的經典工具 NLTK

Last Updated on 2021-03-28 by Clay

NLTK 全文是 "Nature Language Tool Kit" (NLTK),是 Python 中一個經典的、專門用於進行自然語言處理的工具。

雖然也能進行部份中文的處理,但是對於中文的支援度自然沒有英文來得好,故今天的範例全部都將由處理英文語料來示範。

首先我們先來闡述一下 NLTK 進行文本前處理的幾個流程項目:

  • sentence segmentation (斷句)
  • word segmentation (斷詞)
  • pos (詞性標記)
  • lemmatization (字型還原)
  • stopword (停用詞)
  • ner (命名實體辨識)

當然實際上文本前處理並不僅止於這些項目、甚至還混進了幾個詞語分析的功能在裡頭。不過基本上,這就是我今天簡單介紹的功能範例的順序。


準備工作

首先,我們必須得使用以下指令安裝 NLTK 在我們的 Python 環境中。

pip3 install nltk

接著開啟一份檔案,在最開頭匯入 NLTK 這個套件,才能進行後續的工作。

import nltk


如有跳出諸如:

Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

此類的報錯,便依照錯誤訊息下載安裝。

比方說缺乏 punkt,便使用:

import nltk
nltk.download("punkt")


並執行程式,應能下載完成。下載完後,便不需要在每次使用 NLTK 時下載了。


Sentence Segmentation (斷句)

分析一段文本,自然首先從『斷句』的工作開始。

假設我們有以下這樣一個要分析的文本:

I went to Japan. (NOT I went to the Japan.)
He played tennis with Ben. (NOT He played tennis with the Ben.)
They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.)
(Some words don't have an article. We don't usually use articles for countries, meals or people.)

我們將其存入變數 text 之中。

text = """I went to Japan. (NOT I went to the Japan.)
He played tennis with Ben. (NOT He played tennis with the Ben.)
They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.)
(Some words don't have an article. We don't usually use articles for countries, meals or people.)"""


我們只需要呼叫 nltk 的函式便能夠簡單地進行斷句:

sentences = nltk.sent_tokenize(text)


Output:

['I went to Japan.', '(NOT I went to the Japan.)', 'He played tennis with Ben.', '(NOT He played tennis with the Ben.)', 'They had breakfast at 9 o’clock.', "(NOT They had a breakfast at 9 o'clock.)", "(Some words don't have an article.", "We don't usually use articles for countries, meals or people.)"]


Word Segmentation (斷詞)

考量到後續進行的工作,斷詞我們得從原先斷句的輸出開始處理。

tokens = [nltk.tokenize.word_tokenize(sent) for sent in sentences]
for token in tokens:
    print(token)


Output:

['I', 'went', 'to', 'Japan', '.']
['(', 'NOT', 'I', 'went', 'to', 'the', 'Japan', '.', ')']
['He', 'played', 'tennis', 'with', 'Ben', '.']
['(', 'NOT', 'He', 'played', 'tennis', 'with', 'the', 'Ben', '.', ')']
['They', 'had', 'breakfast', 'at', '9', 'o', '’', 'clock', '.']
['(', 'NOT', 'They', 'had', 'a', 'breakfast', 'at', '9', "o'clock", '.', ')']
['(', 'Some', 'words', 'do', "n't", 'have', 'an', 'article', '.']
['We', 'do', "n't", 'usually', 'use', 'articles', 'for', 'countries', ',', 'meals', 'or', 'people', '.', ')']

這樣一來就完成了基本斷詞的工作。但這樣還沒有結束,我們之後還得考慮所謂的『字型還原』、『停用詞』等等的工作。


POS (詞性標記)

但在進行『字型還原』跟『停用詞』之前,也許我們應該先進行 pos 『詞性標記』的工作。

若是在字型還原了以後,很有可能我們分析的詞性會出現問題,而且,要進行良好的字型還原,我們也需要這個詞在文本中的詞性。(等一下會在程式碼中看到)

故我們先進行 pos 的工作。

pos = [nltk.pos_tag(token) for token in tokens]
for item in pos:
    print(item)


Output:

[('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('Japan', 'NNP'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('Japan', 'NNP'), ('.', '.'), (')', ')')]
[('He', 'PRP'), ('played', 'VBD'), ('tennis', 'NN'), ('with', 'IN'), ('Ben', 'NNP'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('He', 'PRP'), ('played', 'VBD'), ('tennis', 'NN'), ('with', 'IN'), ('the', 'DT'), ('Ben', 'NNP'), ('.', '.'), (')', ')')]
[('They', 'PRP'), ('had', 'VBD'), ('breakfast', 'NN'), ('at', 'IN'), ('9', 'CD'), ('o', 'JJ'), ('’', 'NN'), ('clock', 'NN'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('They', 'PRP'), ('had', 'VBD'), ('a', 'DT'), ('breakfast', 'NN'), ('at', 'IN'), ('9', 'CD'), ("o'clock", 'NN'), ('.', '.'), (')', ')')]
[('(', '('), ('Some', 'DT'), ('words', 'NNS'), ('do', 'VBP'), ("n't", 'RB'), ('have', 'VB'), ('an', 'DT'), ('article', 'NN'), ('.', '.')]
[('We', 'PRP'), ('do', 'VBP'), ("n't", 'RB'), ('usually', 'RB'), ('use', 'VB'), ('articles', 'NNS'), ('for', 'IN'), ('countries', 'NNS'), (',', ','), ('meals', 'NNS'), ('or', 'CC'), ('people', 'NNS'), ('.', '.'), (')', ')')]


Lemmatization (字型還原)

字型還原的程式比較長,也許是我寫得不怎麼精簡,還請各位多多包涵。

wordnet_pos = []
for p in pos:
    for word, tag in p:
        if tag.startswith('J'):
            wordnet_pos.append(nltk.corpus.wordnet.ADJ)
        elif tag.startswith('V'):
            wordnet_pos.append(nltk.corpus.wordnet.VERB)
        elif tag.startswith('N'):
            wordnet_pos.append(nltk.corpus.wordnet.NOUN)
        elif tag.startswith('R'):
            wordnet_pos.append(nltk.corpus.wordnet.ADV)
        else:
            wordnet_pos.append(nltk.corpus.wordnet.NOUN)

# Lemmatizer
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(p[n][0], pos=wordnet_pos[n]) for p in pos for n in range(len(p))]

for token in tokens:
    print(token)


Output:

I
go
to
Japan
.
(
NOT
I
went
to
the
Japan
.
)
He
play
tennis
with
Ben
.
(
NOT
He
played
tennis
with
the
Ben
.
)
They
have
breakfast
at
9
o
’
clock
.
(
NOT
They
had
a
breakfast
at
9
o'clock
.
)
(
Some
word
do
n't
have
an
article
.
We
do
n't
usually
use
article
for
country
,
meal
or
people
.
)


Stopword (停用詞)

停用詞相當地簡單易懂,基本上就是 import 進 nltk 的停用詞列表,然後再用 for 迴圈 將不在停用詞列表裡的詞存起來即可。

nltk_stopwords = nltk.corpus.stopwords.words("english")
tokens = [token for token in tokens if token not in nltk_stopwords]
for token in tokens:
    print(token)


Output:

I
go
Japan
.
(
NOT
I
went
Japan
.
)
He
play
tennis
Ben
.
(
NOT
He
played
tennis
Ben
.
)
They
breakfast
9
’
clock
.
(
NOT
They
breakfast
9
o'clock
.
)
(
Some
word
n't
article
.
We
n't
usually
use
article
country
,
meal
people
.
)


NER (命名實體辨識)

ne_chunked_sents = [nltk.ne_chunk(tag) for tag in pos]
named_entities = []

for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        if hasattr(tagged_tree, 'label'):
            entity_name = ' '.join(c[0] for c in tagged_tree.leaves())
            entity_type = tagged_tree.label()
            named_entities.append((entity_name, entity_type))
            named_entities = list(set(named_entities))

for ner in named_entities:
    print(ner)


Output:

('Ben', 'ORGANIZATION')
('Japan', 'GPE')
('Ben', 'PERSON')



完整程式碼

# coding: utf-8
import urllib.request
import ssl
import nltk

# Unverified context
ssl._create_default_https_context = ssl._create_unverified_context
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")
nltk.download("wordnet")
nltk.download("stopwords")
nltk.download("maxent_ne_chunker")
nltk.download("words")



text = """I went to Japan. (NOT I went to the Japan.)
He played tennis with Ben. (NOT He played tennis with the Ben.)
They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.)
(Some words don't have an article. We don't usually use articles for countries, meals or people.)"""


# Sentences
sentences = nltk.sent_tokenize(text)


# Tokenize
tokens = [nltk.tokenize.word_tokenize(sent) for sent in sentences]


# POS
pos = [nltk.pos_tag(token) for token in tokens]


# Lemmatization
wordnet_pos = []
for p in pos:
    for word, tag in p:
        if tag.startswith('J'):
            wordnet_pos.append(nltk.corpus.wordnet.ADJ)
        elif tag.startswith('V'):
            wordnet_pos.append(nltk.corpus.wordnet.VERB)
        elif tag.startswith('N'):
            wordnet_pos.append(nltk.corpus.wordnet.NOUN)
        elif tag.startswith('R'):
            wordnet_pos.append(nltk.corpus.wordnet.ADV)
        else:
            wordnet_pos.append(nltk.corpus.wordnet.NOUN)

# Lemmatizer
lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(p[n][0], pos=wordnet_pos[n]) for p in pos for n in range(len(p))]


# Stopwords
nltk_stopwords = nltk.corpus.stopwords.words("english")
tokens = [token for token in tokens if token not in nltk_stopwords]


# NER
ne_chunked_sents = [nltk.ne_chunk(tag) for tag in pos]
named_entities = []

for ne_tagged_sentence in ne_chunked_sents:
    for tagged_tree in ne_tagged_sentence:
        if hasattr(tagged_tree, 'label'):
            entity_name = ' '.join(c[0] for c in tagged_tree.leaves())
            entity_type = tagged_tree.label()
            named_entities.append((entity_name, entity_type))
            named_entities = list(set(named_entities))



Postscript

『自然語言處理』(NLP) 是一個博大精深的學問,為了進行各式各樣的研究與分析,誕生了許許多多好用的工具,凡舉 "Stanford CoreNLP"、"NLTK"、"SnowNLP" ...... 都是非常知名而且有用的工具。

在網路上可以找到的許多資源當中,我們應該針對我們所處理的任務,測試各種不同工具的效果。畢竟既然是不同的工具,想必不會出現只有一種特別好用的工具適用所有工作的情況發生。

所以,多方嘗試還是相當不錯的。


References


Read more

Tags:

Leave a Reply