Skip to content

[NLP][Python] NLP tool for Chinese text: HanLP

HanLp (Han Language Processing) is open-source project on Github, it provided many functions:

  • Segmentation
  • Part-of-Speech
  • Named entity recognition
  • Keyword extraction
  • Text summarization
  • Convert Traditional to Simplified
  • Text recommendation
  • Text classification
  • Word2Vec

If you want to read more document of it, you can refer here: https://github.com/hankcs/HanLP

Or you want to take a demo: http://hanlp.com/


Preparation

pip3 install opencc-python-reimplemented
pip3 install pyhanlp

OpenCC is a tool I used to convert Traditional Chinese to Simplified Chinese, because it’s effect is better than Traditional Chinese.

In fact, pyhanlp is a Python API to call HanLP which is pure Java. It need to download a model when you use it in first time.


Segmentation & POS

These two functions are used together in pyhanlp. Let’s take a look of short sample code:

# -*- coding: utf-8 -*-
from opencc import OpenCC
from pyhanlp import *

cc = OpenCC('tw2sp')
text = '傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。'
text = cc.convert(text)

# Tokenize
print(HanLP.segment(text))
for term in HanLP.segment(text):
    print(term.word, term.nature)


Output:

[傅/nz, 达仁/ns, 今/tg, 将/d, 运行/vn, 安乐死/v, ,/w, 却/d, 突然/ad, 爆出/v, 自己/rr, 20/m, 年前/t, 遭/v, 纬/ng, 来/vf, 体育台/nz, 封杀/v, ,/w, 他/rr, 不懂/v, 自己/rr, 哪里/rys, 得罪/v, 到/v, 电视台/nis, 。/w]

傅 nz
达仁 ns
今 tg
将 d
运行 vn
安乐死 v
, w
却 d
突然 ad
爆出 v
自己 rr
20 m
年前 t
遭 v
纬 ng
来 vf
体育台 nz
封杀 v
, w
他 rr
不懂 v
自己 rr
哪里 rys
得罪 v
到 v
电视台 nis
。 w

Keywords extraction

You can give a short text and use HanLP.extractKeyword() to get some keywords.

print(HanLP.extractKeyword(text, 5))


Output:

[体育台, 突然, 爆出, 纬, 年前]

Text summarization

This is a very interesting feature. According to my experience, longer text will be more effective.

print(HanLP.extractSummary(text, 6))


Output:

[他不懂自己哪里得罪到电视台]

The problem is that no matter how many number I set, it only display this sentence. It may be that there are no more sentence that can be summarized?


Dependency Parsing

Dependency Parsing is similar to Stanford CoreNLP.

print(HanLP.parseDependency(text))


Output:

1    傅达仁 傅达仁 nh  nr  _   4   主谓关系    _   _
2    今   今   Tg  Tg  _   4   状中结构    _   _
3    将   将   d   d   _   4   状中结构    _   _
4    运行  运行  v   v   _   0   核心关系    _   _
5    安乐死 安乐死 a   a   _   4   动宾关系    _   _
6    ,   ,   wp  w   _   4   标点符号    _   _
7    却   却   d   d   _   9   状中结构    _   _
8    突然  突然  a   ad  _   9   状中结构    _   _
9    爆出  爆出  v   v   _   4   并列关系    _   _
10    自己  自己  r   r   _   11  定中关系    _   _
11    20年前    20年前    nt  t   _   12  状中结构    _   _
12    遭   遭   v   v   _   9   动宾关系    _   _
13    纬来  纬来  v   v   _   14  定中关系    _   _
14    体育台 体育台 n   n   _   15  主谓关系    _   _
15    封杀  封杀  v   v   _   12  动宾关系    _   _
16    ,   ,   wp  w   _   9   标点符号    _   _
17    他   他   r   r   _   19  主谓关系    _   _
18    不   不   d   d   _   19  状中结构    _   _
19    懂   懂   v   v   _   9   并列关系    _   _
20    自己  自己  r   r   _   22  主谓关系    _   _
21    哪里  哪里  r   r   _   22  状中结构    _   _
22    得罪  得罪  v   v   _   19  动宾关系    _   _
23    到   到   v   v   _   22  动补结构    _   _
24    电视台 电视台 n   n   _   22  动宾关系    _   _
25    。   。   wp  w   _   4   标点符号    _   _

Supplement

If you need more functions and pyhanlp doesn’t currently have it, you can use JClass('HANLP_CLASS_PATH') to call it. As for different functions, you may have to go to HanLP’s Github to find them.


Read More

Tags:

Leave a Reply