Last Updated on 2022-08-06 by Clay
Introduction
The task of using Python for Natural Language Processing (NLP) is quite common in the field of machine learning and deep learning. Of course, the tools and libraries for text preprocessing are avaiable.
Today I want to record, how to convert full-width character into half-width character (or half to full) with python.
In many original texts, there may be various full-width and half-width characters. We all know that even though 9and 9 look the same to humans, but different to computers.
# coding: utf-8
def main():
data_1 = "9"
data_2 = "9"
print(data_1 == data_2)
if __name__ == "__main__":
main()
Output:
False
Conversion
Method 1: Convert Unicode
In python, we can use ord()
to get the unicode of character; and we can use chr()
to get the character of unicode.
- The unicode encoding of full-width character is from 65281 to 65374
- The unicode encoding of half-width character is from 33 to 126
If you want to convert, half-width only needs to add 65248 to unicode encoding to convert to full-width; Conversely, full-width to half-width needs to subtract 65248.
# coding: utf-8
def full2half(c: str) -> str:
return chr(ord(c)-65248)
def half2full(c: str) -> str:
return chr(ord(c)+65248)
def main() -> str:
c = "A"
print(f"original fullwidth: {c}")
print(f"to halfwidth = {full2half(c)}")
print(f"back to fullwidth: {half2full(full2half(c))}")
if __name__ == "__main__":
main()
Output:
original fullwidth: A
to halfwidth = A
back to fullwidth: A
This is a relatively straightforward conversion method.
Method 2: Use unicodedata Module
Another way is to use python built-in unicodedata module; But the unicodedata just can convert full-width into half-width.
But just call a module and done, It is a more easy way. In addition, the performance should also be better than directly doing unicode encoding conversion by ourself.
# coding: utf-8
import unicodedata
def full2half(c: str) -> str:
return unicodedata.normalize("NFKC", c)
def half2full(c: str) -> str:
return chr(ord(c)+65248)
def main() -> str:
c = "A"
print(f"original fullwidth: {c}")
print(f"to halfwidth = {full2half(c)}")
print(f"back to fullwidth: {half2full(full2half(c))}")
if __name__ == "__main__":
main()
Output:
original fullwidth: A
to halfwidth = A
back to fullwidth: A
References
- https://stackoverflow.com/questions/10959227/how-to-distinguish-whether-a-word-is-half-width-or-full-width
- https://docs.python.org/3/library/unicodedata.html
Read More
- [Tool] Use ConvertZ to convert cp950 to Unicode
- [Solved] Python SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-5: truncated \UXXXXXXXX escape