[Python] 使用 difflib 模組比較序列差異

Last Updated on 2024-08-28 by Clay

difflib 是 Python 標準函式庫中的一個模組，用於比較序列之間的差異（通常為文字）。早些年我在做碩論時還自己實現，直到現在工作了才發現有這麼簡潔的函式庫，煞是哭笑不得。

使用方式

比較相似度

difflib 模組的核心是 SequenceMatcher 方法，它可以直接比較兩個序列之間的差異。而第一個為 None 的參數為 isjunk，允許我們自定義忽略哪些元素或字符，這些被忽略的元素通常被視為『垃圾』（junk）。

import difflib

# Strings
str1 = "apple pie"
str2 = "apple pies"

# SequenceMatche
matcher = difflib.SequenceMatcher(None, str1, str2)

# Get similarity
similarity = matcher.ratio()
print(f"similarity: {similarity:.2f}")

Output:

similarity: 0.95

除此之外，也可以用來計算陣列：

import difflib

# Arrays
arr1 = ["abc", "bca"]
arr2 = ["apple pies", "abc"]

# SequenceMatche
matcher = difflib.SequenceMatcher(None, arr1, arr2)

# Get similarity
similarity = matcher.ratio()
print(f"similarity: {similarity:.2f}")

Output:

similarity: 0.50

找出差異

difflib 也提供了 ndiff 和 unified_diff 來具體比較兩個文本之間的差異。

import difflib

text1 = """Hello world
This is an example
Goodbye world"""

text2 = """Hello world
This is an example program
Goodbye world"""

diff = difflib.ndiff(text1.splitlines(), text2.splitlines())
print("\n".join(diff))

Output:

  Hello world
- This is an example
+ This is an example program
?                   ++++++++

  Goodbye world

而 unified_diff 則是另外一種顯示格式：

import difflib

text1 = """Hello world
This is an example
Goodbye world"""

text2 = """Hello world
This is an example program
Goodbye world"""

diff = difflib.unified_diff(
    text1.splitlines(),
    text2.splitlines(), 
    fromfile="text1.txt",
    tofile="text2.txt",
)
print("\n".join(diff))

Output:

--- text1.txt

+++ text2.txt

@@ -1,3 +1,3 @@

 Hello world
-This is an example
+This is an example program
 Goodbye world

找出最相近的字串

接下來的才是我這次研究用來改進我程式碼的用途：找出最相近的候選字串。因為我在透過我的 LLM 生成報告分數萃取時，會有幻覺問題產生，產生一個跟真正報告分數極度接近、但又不同的幻覺數字；此時，我就可以透過 get_close_matches() 去找出最接近的正確答案。

import difflib

word = "8.550"
word_list = ["8.50", "9.32", "0.50", "0.550"]

# Find the close matches
matches = difflib.get_close_matches(word, word_list, n=3, cutoff=0.6)
print(matches)

Output:

['8.50', '0.550', '0.50']

References

[Python] 使用 tempfile 模組建立臨時工作目錄，並在工作流程結束後自動刪除

[Python] 使用 functools.partial() 固定函式參數並返回新的 partial 物件

[Python] 使用 difflib 模組比較序列差異

使用方式

比較相似度

找出差異

找出最相近的字串

References

Read More

相關

Leave a Reply取消回覆

[Python] 使用 difflib 模組比較序列差異

使用方式

比較相似度

找出差異

找出最相近的字串

References

Read More

分享此文：

相關

Leave a Reply取消回覆