Last Updated on 2021-03-29 by Clay
To segment words from sentence is very important in Chinese. In English you can segment words with space but Chinese cannot.
Let’s take an example.
English | Today is a nice day. |
Chinese | 今天是個好日子。 |
If you segment words with space, you will get the following results.
English
Today
is
a
nice
day.
Chinese
今
天
是
個
好
日
子
But Chinese results is fault. In Chinese, the translation of “Today” is 今天, and “day” is 日子.
Correct Chinese Result:
English | Chinese |
---|---|
Today | 今天 |
is | 是 |
a | 個 |
nice | 好 |
day | 日子 |
So, we need a tool to segment Chinese word correctly, which is the package “Jieba” I will introduce today.
Jieba is an open source project on Github and its advantage is lightweight and its processing is very fast. If its result is wrong, you can also set a custom user dictionary for it.
In Github README file, its algorithm is expressed as the following description:
- Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.
- Use dynamic programming to find the most probable combination based on the word frequency.
- For unknown words, a HMM-based model is used with the Viterbi algorithm.
If you want to read more info about these project, you can refer here.
How to use Jieba package
You need to use the following command to install it.
pip3 install jieba
Or if you insist on not using Python, you can refer:
- Jave: https://github.com/huaban/jieba-analysis
- C++: https://github.com/yanyiwu/cppjieba
- Rust: https://github.com/messense/jieba-rs
- Node.js: https://github.com/yanyiwu/nodejieba
- Erlang: https://github.com/falood/exjieba
- R: https://github.com/qinwf/jiebaR
- iOS: https://github.com/yanyiwu/iosjieba
- PHP: https://github.com/fukuball/jieba-php
- .Net (C#): https://github.com/anderscui/jieba.NET/
- Go: https://github.com/wangbin/jiebago or https://github.com/yanyiwu/gojieba
- Android: https://github.com/452896915/jieba-android
Segmentation
jiaba.cut()
is the function we need to used, and it receive 3 arguments.
- (str) TEXT_WE_WANT_TO_SEGMENT
- (bool) activate cut_all mode or not
- (bool) use HMM model or not
We use an examples on Github, but the text is Traditional Chinese (NOT Simplify Chinese).
Chinese: 我來到北京清華大學
English: I came to Beijing Tsinghua University
# -*- coding: utf-8 -*- import jieba text = '我來到北京清華大學' print('Default (HMM):', '|'.join(jieba.cut(text, cut_all=False, HMM=True))) print('All deactivate:', '|'.join(jieba.cut(text, cut_all=False, HMM=False))) print('All activate:', '|'.join(jieba.cut(text, cut_all=True, HMM=True)))
Output:
Default (HMM): 我|來到|北京|清華|大學
All deactivate: 我|來到|北京|清華|大學
All activate: 我來|來到|北京|清華|華大|大學
jieba.cut()
function return a generator object, you can use join()
function like me to add any character to split words.
Or you can use jieba.lcut()
to return a List object.
User dictionary
Sometimes we found the results of Jieba processing is not really good, maybe we have too much PROPRIETARY VOCABULARY in our text. In this case, we can set user dictionary to help us.
Let’s take another example.
總統蔡英文論文風波延燒後,最新民調今日出爐!據親藍民調公布結果,蔡英文支持度45%,遙遙領先韓國瑜的33%,兩人差距擴大到12個百分點。顯示論文門風波,並未重創小英聲望。
It’s a simple Taiwan news that I extracted from Google News today (2019/09/23). If we use Jieba to process it, we will get the following result.
['總統', '蔡英文', '論文', '風波', '延燒', '後', ',', '最', '新', '民調', '今日', '出爐', '!', '據', '親藍', '民調', '公布', '結果', ',', '蔡英文', '支持度', '45%', ',', '遙遙領先', '韓國', '瑜', '的', '33%', ',', '兩人', '差距', '擴大到', '12', '個', '百分點', '。', '顯示', '論文', '門', '風波', ',', '並未', '重創', '小英', '聲望', '。']
The most of segmentation is not bad, but there is a fatal failed is 韓國瑜 cut to become 韓國 and 瑜.
韓國 in Chinese is meaning Korea. 瑜 may be considered a kind of jade.
But 韓國瑜 is a man, guys. I am a big fan of him. (Not really)
So we can create a userDict.txt file, and write down the word you want to parse and its frequency:
韓國瑜 3000
And go back to the beginning of the code and add:
jieba.load_userdict('userDict.txt')
And use Jieba.lcut()
function.
print(jieba.lcut(text))
Output:
['總統', '蔡英文', '論文', '風波', '延燒', '後', ',', '最', '新', '民調', '今日', '出爐', '!', '據', '親藍', '民調', '公布', '結果', ',', '蔡英文', '支持度', '45%', ',', '遙遙領先', '韓國瑜', '的', '33%', ',', '兩人', '差距', '擴大到', '12', '個', '百分點', '。', '顯示', '論文', '門', '風波', ',', '並未', '重創', '小英', '聲望', '。']
This time we succeeded.
Parts Of Speech (POS)
If you want to use POS function in Jieba, we need to import pseg module.
# -*- coding: utf-8 -*- import jieba import jieba.posseg as pseg text = '我來到北京清華大學' words = pseg.cut(text) for word, flag in words: print(word, flag)
Output:
我 r
來到 x
北京 ns
清華 x
大學 x
If you want to know the part of speech of each word, please refer to: https://www.cnblogs.com/chenbjin/p/4341930.html
Keyword extraction
According to the description of README, the method of keyword extraction is to calculate of TF-IDF. And we need to import jieba.analyse
.
# -*- coding: utf-8 -*- import jieba import jieba.analyse text = '總統蔡英文論文風波延燒後,最新民調今日出爐!據親藍民調公布結果,蔡英文支持度45%,遙遙領先韓國瑜的33%,兩人差距擴大到12個百分點。顯示論文門風波,並未重創小英聲望。' tags = jieba.analyse.extract_tags(text, topK=5) print(tags)
Output:
['蔡英文', '論文', '風波', '民調', '小英']
The argument topK is the number of keyword return, default is 20 but I set to 5 here.