Skip to content

[Python] How to Convert String Full-width into Half-width Character

Introduction

The task of using Python for Natural Language Processing (NLP) is quite common in the field of machine learning and deep learning. Of course, the tools and libraries for text preprocessing are avaiable.

Today I want to record, how to convert full-width character into half-width character (or half to full) with python.

In many original texts, there may be various full-width and half-width characters. We all know that even though 9and 9 look the same to humans, but different to computers.

# coding: utf-8


def main():
    data_1 = "9"
    data_2 = "9"

    print(data_1 == data_2)


if __name__ == "__main__":
    main()

Output:

False

Conversion

Method 1: Convert Unicode

In python, we can use ord() to get the unicode of character; and we can use chr() to get the character of unicode.

  • The unicode encoding of full-width character is from 65281 to 65374
  • The unicode encoding of half-width character is from 33 to 126

If you want to convert, half-width only needs to add 65248 to unicode encoding to convert to full-width; Conversely, full-width to half-width needs to subtract 65248.

# coding: utf-8


def full2half(c: str) -> str:
    return chr(ord(c)-65248)


def half2full(c: str) -> str:
    return chr(ord(c)+65248)


def main() -> str:
    c = "A"

    print(f"original fullwidth: {c}")
    print(f"to halfwidth = {full2half(c)}")
    print(f"back to fullwidth: {half2full(full2half(c))}")


if __name__ == "__main__":
    main()

Output:

original fullwidth: A
to halfwidth = A
back to fullwidth: A

 

This is a relatively straightforward conversion method.

 

 

 

 

 

Method 2: Use unicodedata Module

Another way is to use python built-in unicodedata module; But the unicodedata just can convert full-width into half-width.

But just call a module and done, It is a more easy way. In addition, the performance should also be better than directly doing unicode encoding conversion by ourself.

# coding: utf-8
import unicodedata


def full2half(c: str) -> str:
    return unicodedata.normalize("NFKC", c)


def half2full(c: str) -> str:
    return chr(ord(c)+65248)


def main() -> str:
    c = "A"

    print(f"original fullwidth: {c}")
    print(f"to halfwidth = {full2half(c)}")
    print(f"back to fullwidth: {half2full(full2half(c))}")


if __name__ == "__main__":
    main()

Output:

original fullwidth: A
to halfwidth = A
back to fullwidth: A

References


Read More

 

Tags:

Leave a Reply