Last Updated on 2024-08-28 by Clay
difflib
是 Python 標準函式庫中的一個模組,用於比較序列之間的差異(通常為文字)。早些年我在做碩論時還自己實現,直到現在工作了才發現有這麼簡潔的函式庫,煞是哭笑不得。
使用方式
比較相似度
difflib
模組的核心是 SequenceMatcher
方法,它可以直接比較兩個序列之間的差異。而第一個為 None
的參數為 isjunk
,允許我們自定義忽略哪些元素或字符,這些被忽略的元素通常被視為『垃圾』(junk)。
import difflib
# Strings
str1 = "apple pie"
str2 = "apple pies"
# SequenceMatche
matcher = difflib.SequenceMatcher(None, str1, str2)
# Get similarity
similarity = matcher.ratio()
print(f"similarity: {similarity:.2f}")
Output:
similarity: 0.95
除此之外,也可以用來計算陣列:
import difflib
# Arrays
arr1 = ["abc", "bca"]
arr2 = ["apple pies", "abc"]
# SequenceMatche
matcher = difflib.SequenceMatcher(None, arr1, arr2)
# Get similarity
similarity = matcher.ratio()
print(f"similarity: {similarity:.2f}")
Output:
similarity: 0.50
找出差異
difflib
也提供了 ndiff
和 unified_diff
來具體比較兩個文本之間的差異。
import difflib
text1 = """Hello world
This is an example
Goodbye world"""
text2 = """Hello world
This is an example program
Goodbye world"""
diff = difflib.ndiff(text1.splitlines(), text2.splitlines())
print("\n".join(diff))
Output:
Hello world
- This is an example
+ This is an example program
? ++++++++
Goodbye world
而 unified_diff
則是另外一種顯示格式:
import difflib
text1 = """Hello world
This is an example
Goodbye world"""
text2 = """Hello world
This is an example program
Goodbye world"""
diff = difflib.unified_diff(
text1.splitlines(),
text2.splitlines(),
fromfile="text1.txt",
tofile="text2.txt",
)
print("\n".join(diff))
Output:
--- text1.txt
+++ text2.txt
@@ -1,3 +1,3 @@
Hello world
-This is an example
+This is an example program
Goodbye world
找出最相近的字串
接下來的才是我這次研究用來改進我程式碼的用途:找出最相近的候選字串。因為我在透過我的 LLM 生成報告分數萃取時,會有幻覺問題產生,產生一個跟真正報告分數極度接近、但又不同的幻覺數字;此時,我就可以透過 get_close_matches()
去找出最接近的正確答案。
import difflib
word = "8.550"
word_list = ["8.50", "9.32", "0.50", "0.550"]
# Find the close matches
matches = difflib.get_close_matches(word, word_list, n=3, cutoff=0.6)
print(matches)
Output:
['8.50', '0.550', '0.50']