Skip to content

[NLP][Python] Chinese natural language analysis tools: THULAC

Last Updated on 2021-04-03 by Clay

In fact, I haven’t used this tool for a long time. I suddenly had this need recently, so I found the code that I had studied before, and started to look for functions that I hadn’t tested before, and recorded it as a note by the way.

It was a pity that I didn’t have the habit of taking notes like I do now.

I used THULAC at the beginning because the thesis advisor asked me to use various took to parse Chinese text and compare the effects of different tools.

If I may to say, the accuracy of THULAC really shocked me. I always feel that it is more accurate than Jieba (Another Chinese analysis tool.)

I have recorded several different Chinese analysis tools and I will put their links at the end of this article.

Let’s go back to our topic THULAC.

If you want to try to use THULAC online, you can go to their online demo website: http://thulac.thunlp.org/demo

But the most convenient way is to call the Python package.


Preparation

First we need to install THULAC in our Python environment.

pip3 install thulac

Word segmentation & Part-Of-Speech

We just need to use cut() function and we will get the all results of word segmentation and part-of-speech.

# -*- coding: utf-8 -*-
from opencc import OpenCC
import thulac

cc = OpenCC('tw2sp')
text = '傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。'
text = cc.convert(text)

thu = thulac.thulac()
print(thu.cut(text))


Output:

Model loaded succeed
[['傅达仁', 'np'], ['今将', 'd'], ['运行', 'v'], ['安乐', 'ns'], ['死', 'v'], [',', 'w'], ['却', 'd'], ['突然', 'a'], ['爆出', 'v'], ['自己', 'r'], ['20', 'm'], ['年', 'q'], ['前', 'f'], ['遭', 'v'], ['纬', 'g'], ['来', 'v'], ['体育台', 'n'], ['封杀', 'v'], [',', 'w'], ['他', 'r'], ['不', 'd'], ['懂', 'v'], ['自己', 'r'], ['哪', 'r'], ['里', 'q'], ['得罪', 'v'], ['到', 'v'], ['电视台', 'n'], ['。', 'w']]

First I import the package I need and use OpenCC package to convert Traditional Chinese to Simplified Chinese because this tool process Simplified Chinese is better than Traditional.

The output is a pair of word and part-of-speech.


References


Read More

Tags:

Leave a Reply