Skip to content

[NLP][Python] How to use CKIP to analyzeTraditional Chinese

If you want to use Python NLP toolkit to analyze Traditional Chinese text, CKIP is your first choice. CKIP is developed by Taiwan Institute of Information Science, Academia Sinica, And won rankings in many competitions.

In the previous days, it has not been open source. If you want to use it, you need to go to the online demo website (https://ckip.iis.sinica.edu.tw/demo/) or apply for authorization every month to use the limited download version (.bat or using Python to call).

Fortunately, not long ago (2019/09/04), CKIP finally opened the source code on Github: https://github.com/ckiplab/ckiptagger

So, Today I will introduce how to use it.


Preparation

My environment:

  • Python: >= 3.6
  • Tensorflow: >= 1.13.1 and < 2
  • gdown: the latest version

We need to use the following commands to install ckiptagger, gdown is package for download model from Google Drive.

pip3 install ckiptagger
pip3 install tensorflow
pip3 install gdown

If we done, then open a .py file to write down:

# -*- coding: utf-8 -*-
from ckiptagger import data_utils
data_utils.download_data_gdown("./")


Execute it and the program will download the model we need in the current path. After downloading we can delete this download command.

Next, we practice how to use ckiptagger get words segmentation, part-of-speech and named-entity-recognition.


Segmentation & POS & NER

Forgive me for talking about these three different NLP tasks together, because the program is really short, and the NER of CKIP needs the result of POS. This is very similar to NLTK.

# -*- coding: utf-8 -*-
from ckiptagger import WS, POS, NER

text = '傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。'
ws = WS("./data")
pos = POS("./data")
ner = NER("./data")

ws_results = ws([text])
pos_results = pos(ws_results)
ner_results = ner(ws_results, pos_results)

print(ws_results)
print(pos_results)
for name in ner_results[0]:
    print(name)


Output:

[['傅達仁', '今', '將', '執行', '安樂死', ',', '卻', '突然', '爆出', '自己', '20', '年', '前', '遭', '緯來', '體育台', '封殺', ',', '他', '不', '懂', '自己', '哪裡', '得罪到', '電視台', '。']]
[['Nb', 'VJ', 'Na', 'Nd', 'D', 'VC', 'VA', 'D', 'VH', 'VH', 'COMMACATEGORY', 'D', 'D', 'VH', 'VH', 'VC', 'P', 'Nh', 'Neu', 'Neu', 'Nf', 'Ng', 'P', 'Na', 'VA', 'Na', 'VC', 'Nc', 'VC', 'VC', 'COMMACATEGORY', 'Nh', 'D', 'VK', 'P', 'Nh', 'Nep', 'Ncd', 'VJ', 'Na', 'P', 'Na', 'P', 'Nc', 'PERIODCATEGORY']]
(0, 3, 'PERSON', '傅達仁')
(18, 22, 'DATE', '20年前')

First we need to import WS, POS and NER, and we need to set the model we download.

By the way, the text I analyzed is provided by Github.

ws = WS("./data")
pos = POS("./data")
ner = NER("./data")


And we can use the following code to analyze text.

ws_results = ws([text])
pos_results = pos(ws_results)
ner_results = ner(ws_results, pos_results)

print(ws_results)
print(pos_results)
print(ner_results)


Output:

[['傅達仁', '今', '將', '執行', '安樂死', ',', '卻', '突然', '爆出', '自己', '20', '年', '前', '遭', '緯來', '體育台', '封殺', ',', '他', '不', '懂', '自己', '哪裡', '得罪到', '電視台', '。']]
[['Nb', 'VJ', 'Na', 'Nd', 'D', 'VC', 'VA', 'D', 'VH', 'VH', 'COMMACATEGORY', 'D', 'D', 'VH', 'VH', 'VC', 'P', 'Nh', 'Neu', 'Neu', 'Nf', 'Ng', 'P', 'Na', 'VA', 'Na', 'VC', 'Nc', 'VC', 'VC', 'COMMACATEGORY', 'Nh', 'D', 'VK', 'P', 'Nh', 'Nep', 'Ncd', 'VJ', 'Na', 'P', 'Na', 'P', 'Nc', 'PERIODCATEGORY']]
(0, 3, 'PERSON', '傅達仁')
(18, 22, 'DATE', '20年前')

Part-of-speech you can refer official document: http://ckipsvr.iis.sinica.edu.tw/cat.htm


References


Read More

Tags:

Leave a Reply