Last Updated on 2021-03-28 by Clay
NLTK 全文是 "Nature Language Tool Kit" (NLTK),是 Python 中一個經典的、專門用於進行自然語言處理的工具。
雖然也能進行部份中文的處理,但是對於中文的支援度自然沒有英文來得好,故今天的範例全部都將由處理英文語料來示範。
首先我們先來闡述一下 NLTK 進行文本前處理的幾個流程項目:
- sentence segmentation (斷句)
- word segmentation (斷詞)
- pos (詞性標記)
- lemmatization (字型還原)
- stopword (停用詞)
- ner (命名實體辨識)
當然實際上文本前處理並不僅止於這些項目、甚至還混進了幾個詞語分析的功能在裡頭。不過基本上,這就是我今天簡單介紹的功能範例的順序。
準備工作
首先,我們必須得使用以下指令安裝 NLTK 在我們的 Python 環境中。
pip3 install nltk
接著開啟一份檔案,在最開頭匯入 NLTK 這個套件,才能進行後續的工作。
import nltk
如有跳出諸如:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
此類的報錯,便依照錯誤訊息下載安裝。
比方說缺乏 punkt,便使用:
import nltk nltk.download("punkt")
並執行程式,應能下載完成。下載完後,便不需要在每次使用 NLTK 時下載了。
Sentence Segmentation (斷句)
分析一段文本,自然首先從『斷句』的工作開始。
假設我們有以下這樣一個要分析的文本:
I went to Japan. (NOT I went to the Japan.)
He played tennis with Ben. (NOT He played tennis with the Ben.)
They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.)
(Some words don't have an article. We don't usually use articles for countries, meals or people.)
我們將其存入變數 text 之中。
text = """I went to Japan. (NOT I went to the Japan.) He played tennis with Ben. (NOT He played tennis with the Ben.) They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.) (Some words don't have an article. We don't usually use articles for countries, meals or people.)"""
我們只需要呼叫 nltk 的函式便能夠簡單地進行斷句:
sentences = nltk.sent_tokenize(text)
Output:
['I went to Japan.', '(NOT I went to the Japan.)', 'He played tennis with Ben.', '(NOT He played tennis with the Ben.)', 'They had breakfast at 9 o’clock.', "(NOT They had a breakfast at 9 o'clock.)", "(Some words don't have an article.", "We don't usually use articles for countries, meals or people.)"]
Word Segmentation (斷詞)
考量到後續進行的工作,斷詞我們得從原先斷句的輸出開始處理。
tokens = [nltk.tokenize.word_tokenize(sent) for sent in sentences] for token in tokens: print(token)
Output:
['I', 'went', 'to', 'Japan', '.']
['(', 'NOT', 'I', 'went', 'to', 'the', 'Japan', '.', ')']
['He', 'played', 'tennis', 'with', 'Ben', '.']
['(', 'NOT', 'He', 'played', 'tennis', 'with', 'the', 'Ben', '.', ')']
['They', 'had', 'breakfast', 'at', '9', 'o', '’', 'clock', '.']
['(', 'NOT', 'They', 'had', 'a', 'breakfast', 'at', '9', "o'clock", '.', ')']
['(', 'Some', 'words', 'do', "n't", 'have', 'an', 'article', '.']
['We', 'do', "n't", 'usually', 'use', 'articles', 'for', 'countries', ',', 'meals', 'or', 'people', '.', ')']
這樣一來就完成了基本斷詞的工作。但這樣還沒有結束,我們之後還得考慮所謂的『字型還原』、『停用詞』等等的工作。
POS (詞性標記)
但在進行『字型還原』跟『停用詞』之前,也許我們應該先進行 pos 『詞性標記』的工作。
若是在字型還原了以後,很有可能我們分析的詞性會出現問題,而且,要進行良好的字型還原,我們也需要這個詞在文本中的詞性。(等一下會在程式碼中看到)
故我們先進行 pos 的工作。
pos = [nltk.pos_tag(token) for token in tokens] for item in pos: print(item)
Output:
[('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('Japan', 'NNP'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('I', 'PRP'), ('went', 'VBD'), ('to', 'TO'), ('the', 'DT'), ('Japan', 'NNP'), ('.', '.'), (')', ')')]
[('He', 'PRP'), ('played', 'VBD'), ('tennis', 'NN'), ('with', 'IN'), ('Ben', 'NNP'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('He', 'PRP'), ('played', 'VBD'), ('tennis', 'NN'), ('with', 'IN'), ('the', 'DT'), ('Ben', 'NNP'), ('.', '.'), (')', ')')]
[('They', 'PRP'), ('had', 'VBD'), ('breakfast', 'NN'), ('at', 'IN'), ('9', 'CD'), ('o', 'JJ'), ('’', 'NN'), ('clock', 'NN'), ('.', '.')]
[('(', '('), ('NOT', 'NNP'), ('They', 'PRP'), ('had', 'VBD'), ('a', 'DT'), ('breakfast', 'NN'), ('at', 'IN'), ('9', 'CD'), ("o'clock", 'NN'), ('.', '.'), (')', ')')]
[('(', '('), ('Some', 'DT'), ('words', 'NNS'), ('do', 'VBP'), ("n't", 'RB'), ('have', 'VB'), ('an', 'DT'), ('article', 'NN'), ('.', '.')]
[('We', 'PRP'), ('do', 'VBP'), ("n't", 'RB'), ('usually', 'RB'), ('use', 'VB'), ('articles', 'NNS'), ('for', 'IN'), ('countries', 'NNS'), (',', ','), ('meals', 'NNS'), ('or', 'CC'), ('people', 'NNS'), ('.', '.'), (')', ')')]
Lemmatization (字型還原)
字型還原的程式比較長,也許是我寫得不怎麼精簡,還請各位多多包涵。
wordnet_pos = [] for p in pos: for word, tag in p: if tag.startswith('J'): wordnet_pos.append(nltk.corpus.wordnet.ADJ) elif tag.startswith('V'): wordnet_pos.append(nltk.corpus.wordnet.VERB) elif tag.startswith('N'): wordnet_pos.append(nltk.corpus.wordnet.NOUN) elif tag.startswith('R'): wordnet_pos.append(nltk.corpus.wordnet.ADV) else: wordnet_pos.append(nltk.corpus.wordnet.NOUN) # Lemmatizer lemmatizer = nltk.stem.wordnet.WordNetLemmatizer() tokens = [lemmatizer.lemmatize(p[n][0], pos=wordnet_pos[n]) for p in pos for n in range(len(p))] for token in tokens: print(token)
Output:
I
go
to
Japan
.
(
NOT
I
went
to
the
Japan
.
)
He
play
tennis
with
Ben
.
(
NOT
He
played
tennis
with
the
Ben
.
)
They
have
breakfast
at
9
o
’
clock
.
(
NOT
They
had
a
breakfast
at
9
o'clock
.
)
(
Some
word
do
n't
have
an
article
.
We
do
n't
usually
use
article
for
country
,
meal
or
people
.
)
Stopword (停用詞)
停用詞相當地簡單易懂,基本上就是 import 進 nltk 的停用詞列表,然後再用 for 迴圈 將不在停用詞列表裡的詞存起來即可。
nltk_stopwords = nltk.corpus.stopwords.words("english") tokens = [token for token in tokens if token not in nltk_stopwords] for token in tokens: print(token)
Output:
I
go
Japan
.
(
NOT
I
went
Japan
.
)
He
play
tennis
Ben
.
(
NOT
He
played
tennis
Ben
.
)
They
breakfast
9
’
clock
.
(
NOT
They
breakfast
9
o'clock
.
)
(
Some
word
n't
article
.
We
n't
usually
use
article
country
,
meal
people
.
)
NER (命名實體辨識)
ne_chunked_sents = [nltk.ne_chunk(tag) for tag in pos] named_entities = [] for ne_tagged_sentence in ne_chunked_sents: for tagged_tree in ne_tagged_sentence: if hasattr(tagged_tree, 'label'): entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) entity_type = tagged_tree.label() named_entities.append((entity_name, entity_type)) named_entities = list(set(named_entities)) for ner in named_entities: print(ner)
Output:
('Ben', 'ORGANIZATION')
('Japan', 'GPE')
('Ben', 'PERSON')
完整程式碼
# coding: utf-8 import urllib.request import ssl import nltk # Unverified context ssl._create_default_https_context = ssl._create_unverified_context nltk.download("punkt") nltk.download("averaged_perceptron_tagger") nltk.download("wordnet") nltk.download("stopwords") nltk.download("maxent_ne_chunker") nltk.download("words") text = """I went to Japan. (NOT I went to the Japan.) He played tennis with Ben. (NOT He played tennis with the Ben.) They had breakfast at 9 o’clock. (NOT They had a breakfast at 9 o'clock.) (Some words don't have an article. We don't usually use articles for countries, meals or people.)""" # Sentences sentences = nltk.sent_tokenize(text) # Tokenize tokens = [nltk.tokenize.word_tokenize(sent) for sent in sentences] # POS pos = [nltk.pos_tag(token) for token in tokens] # Lemmatization wordnet_pos = [] for p in pos: for word, tag in p: if tag.startswith('J'): wordnet_pos.append(nltk.corpus.wordnet.ADJ) elif tag.startswith('V'): wordnet_pos.append(nltk.corpus.wordnet.VERB) elif tag.startswith('N'): wordnet_pos.append(nltk.corpus.wordnet.NOUN) elif tag.startswith('R'): wordnet_pos.append(nltk.corpus.wordnet.ADV) else: wordnet_pos.append(nltk.corpus.wordnet.NOUN) # Lemmatizer lemmatizer = nltk.stem.wordnet.WordNetLemmatizer() tokens = [lemmatizer.lemmatize(p[n][0], pos=wordnet_pos[n]) for p in pos for n in range(len(p))] # Stopwords nltk_stopwords = nltk.corpus.stopwords.words("english") tokens = [token for token in tokens if token not in nltk_stopwords] # NER ne_chunked_sents = [nltk.ne_chunk(tag) for tag in pos] named_entities = [] for ne_tagged_sentence in ne_chunked_sents: for tagged_tree in ne_tagged_sentence: if hasattr(tagged_tree, 'label'): entity_name = ' '.join(c[0] for c in tagged_tree.leaves()) entity_type = tagged_tree.label() named_entities.append((entity_name, entity_type)) named_entities = list(set(named_entities))
Postscript
『自然語言處理』(NLP) 是一個博大精深的學問,為了進行各式各樣的研究與分析,誕生了許許多多好用的工具,凡舉 "Stanford CoreNLP"、"NLTK"、"SnowNLP" ...... 都是非常知名而且有用的工具。
在網路上可以找到的許多資源當中,我們應該針對我們所處理的任務,測試各種不同工具的效果。畢竟既然是不同的工具,想必不會出現只有一種特別好用的工具適用所有工作的情況發生。
所以,多方嘗試還是相當不錯的。