[Python] 字串全形半形轉換方式

前言

使用 Python 進行自然語言處理（NLP）的任務是在機器學習、深度學習領域相當常見的事情，當然，針對文本的前處理所需具備的工具、函式庫自然也是一應俱全。

今天我想記錄的，就是如何使用 Python 進行字串的全形（full-width character）半形（half-width character）轉換。

在許多原始文本中，可能存在著各種全形半形參雜的文字。而我們都知道，就算人類覺得９和 9 看起來都一樣，然而在電腦中以位元組來呈現的形式裡，這兩個字元是截然不同的。

# coding: utf-8


def main():
    data_1 = "９"
    data_2 = "9"

    print(data_1 == data_2)


if __name__ == "__main__":
    main()

Output:

False

全形半形轉換

方法一：unicode 編碼轉換

在 Python 中，我們可以使用 ord() 將字元轉換成 unicode 編碼；以及使用 chr() 將 unicode 編碼轉換回字元。

全形字元的 unicode 編碼是從 65281 到 65374
半形字元的 unicode 編碼是從 33 到 126

如果要進行轉換，半形只需要把 unicode 編碼加上 65248 即可轉換成全形。反之，全形轉為半形則是減去 65248。

# coding: utf-8


def full2half(c: str) -> str:
    return chr(ord(c)-65248)


def half2full(c: str) -> str:
    return chr(ord(c)+65248)


def main() -> str:
    c = "Ａ"

    print(f"original fullwidth: {c}")
    print(f"to halfwidth = {full2half(c)}")
    print(f"back to fullwidth: {half2full(full2half(c))}")


if __name__ == "__main__":
    main()

Output:

original fullwidth: Ａ
to halfwidth = A
back to fullwidth: Ａ

這是比較直接的一種轉換方式。

方法二：使用 unicodedata 模組

另一個經典的方法是直接使用 Python 內建的 unicodedata 模組；不過 unicodedata 只能把全形轉為半形。

但只要 call 一個模組就能完成，顯然比較方便。另外，效能應該也是比直接做 unicode 編碼轉換來得更好。

# coding: utf-8
import unicodedata


def full2half(c: str) -> str:
    return unicodedata.normalize("NFKC", c)


def half2full(c: str) -> str:
    return chr(ord(c)+65248)


def main() -> str:
    c = "Ａ"

    print(f"original fullwidth: {c}")
    print(f"to halfwidth = {full2half(c)}")
    print(f"back to fullwidth: {half2full(full2half(c))}")


if __name__ == "__main__":
    main()

Output:

original fullwidth: Ａ
to halfwidth = A
back to fullwidth: Ａ

[Python] 字串全形半形轉換方式

前言

全形半形轉換

方法一：unicode 編碼轉換

方法二：使用 unicodedata 模組

References

Read More

相關

Leave a Reply取消回覆

[Python] 字串全形半形轉換方式

前言

全形半形轉換

方法一：unicode 編碼轉換

方法二： 使用 unicodedata 模組

References

Read More

分享此文：

相關

Leave a Reply取消回覆

方法二：使用 unicodedata 模組