Last Updated on 2021-04-01 by Clay
HanLp (Han Language Processing) is open-source project on Github, it provided many functions:
- Segmentation
- Part-of-Speech
- Named entity recognition
- Keyword extraction
- Text summarization
- Convert Traditional to Simplified
- Text recommendation
- Text classification
- Word2Vec
- ...
If you want to read more document of it, you can refer here: https://github.com/hankcs/HanLP
Or you want to take a demo: http://hanlp.com/
Preparation
pip3 install opencc-python-reimplemented
pip3 install pyhanlp
OpenCC is a tool I used to convert Traditional Chinese to Simplified Chinese, because it's effect is better than Traditional Chinese.
In fact, pyhanlp is a Python API to call HanLP which is pure Java. It need to download a model when you use it in first time.
Segmentation & POS
These two functions are used together in pyhanlp. Let's take a look of short sample code:
# -*- coding: utf-8 -*- from opencc import OpenCC from pyhanlp import * cc = OpenCC('tw2sp') text = '傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。' text = cc.convert(text) # Tokenize print(HanLP.segment(text)) for term in HanLP.segment(text): print(term.word, term.nature)
Output:
[傅/nz, 达仁/ns, 今/tg, 将/d, 运行/vn, 安乐死/v, ,/w, 却/d, 突然/ad, 爆出/v, 自己/rr, 20/m, 年前/t, 遭/v, 纬/ng, 来/vf, 体育台/nz, 封杀/v, ,/w, 他/rr, 不懂/v, 自己/rr, 哪里/rys, 得罪/v, 到/v, 电视台/nis, 。/w]
傅 nz
达仁 ns
今 tg
将 d
运行 vn
安乐死 v
, w
却 d
突然 ad
爆出 v
自己 rr
20 m
年前 t
遭 v
纬 ng
来 vf
体育台 nz
封杀 v
, w
他 rr
不懂 v
自己 rr
哪里 rys
得罪 v
到 v
电视台 nis
。 w
Keywords extraction
You can give a short text and use HanLP.extractKeyword()
to get some keywords.
print(HanLP.extractKeyword(text, 5))
Output:
[体育台, 突然, 爆出, 纬, 年前]
Text summarization
This is a very interesting feature. According to my experience, longer text will be more effective.
print(HanLP.extractSummary(text, 6))
Output:
[他不懂自己哪里得罪到电视台]
The problem is that no matter how many number I set, it only display this sentence. It may be that there are no more sentence that can be summarized?
Dependency Parsing
Dependency Parsing is similar to Stanford CoreNLP.
print(HanLP.parseDependency(text))
Output:
1 傅达仁 傅达仁 nh nr _ 4 主谓关系 _ _
2 今 今 Tg Tg _ 4 状中结构 _ _
3 将 将 d d _ 4 状中结构 _ _
4 运行 运行 v v _ 0 核心关系 _ _
5 安乐死 安乐死 a a _ 4 动宾关系 _ _
6 , , wp w _ 4 标点符号 _ _
7 却 却 d d _ 9 状中结构 _ _
8 突然 突然 a ad _ 9 状中结构 _ _
9 爆出 爆出 v v _ 4 并列关系 _ _
10 自己 自己 r r _ 11 定中关系 _ _
11 20年前 20年前 nt t _ 12 状中结构 _ _
12 遭 遭 v v _ 9 动宾关系 _ _
13 纬来 纬来 v v _ 14 定中关系 _ _
14 体育台 体育台 n n _ 15 主谓关系 _ _
15 封杀 封杀 v v _ 12 动宾关系 _ _
16 , , wp w _ 9 标点符号 _ _
17 他 他 r r _ 19 主谓关系 _ _
18 不 不 d d _ 19 状中结构 _ _
19 懂 懂 v v _ 9 并列关系 _ _
20 自己 自己 r r _ 22 主谓关系 _ _
21 哪里 哪里 r r _ 22 状中结构 _ _
22 得罪 得罪 v v _ 19 动宾关系 _ _
23 到 到 v v _ 22 动补结构 _ _
24 电视台 电视台 n n _ 22 动宾关系 _ _
25 。 。 wp w _ 4 标点符号 _ _
Supplement
If you need more functions and pyhanlp doesn't currently have it, you can use JClass('HANLP_CLASS_PATH')
to call it. As for different functions, you may have to go to HanLP's Github to find them.