Skip to content

[Python] 使用 difflib 模組比較序列差異

Last Updated on 2024-08-28 by Clay

difflib 是 Python 標準函式庫中的一個模組,用於比較序列之間的差異(通常為文字)。早些年我在做碩論時還自己實現,直到現在工作了才發現有這麼簡潔的函式庫,煞是哭笑不得。


使用方式

比較相似度

difflib 模組的核心是 SequenceMatcher 方法,它可以直接比較兩個序列之間的差異。而第一個為 None 的參數為 isjunk,允許我們自定義忽略哪些元素或字符,這些被忽略的元素通常被視為『垃圾』(junk)。

import difflib

# Strings
str1 = "apple pie"
str2 = "apple pies"

# SequenceMatche
matcher = difflib.SequenceMatcher(None, str1, str2)

# Get similarity
similarity = matcher.ratio()
print(f"similarity: {similarity:.2f}")


Output:

similarity: 0.95


除此之外,也可以用來計算陣列:

import difflib

# Arrays
arr1 = ["abc", "bca"]
arr2 = ["apple pies", "abc"]

# SequenceMatche
matcher = difflib.SequenceMatcher(None, arr1, arr2)

# Get similarity
similarity = matcher.ratio()
print(f"similarity: {similarity:.2f}")


Output:

similarity: 0.50


找出差異

difflib 也提供了 ndiffunified_diff 來具體比較兩個文本之間的差異。

import difflib

text1 = """Hello world
This is an example
Goodbye world"""

text2 = """Hello world
This is an example program
Goodbye world"""

diff = difflib.ndiff(text1.splitlines(), text2.splitlines())
print("\n".join(diff))


Output:

  Hello world
- This is an example
+ This is an example program
? ++++++++

Goodbye world


unified_diff 則是另外一種顯示格式:

import difflib

text1 = """Hello world
This is an example
Goodbye world"""

text2 = """Hello world
This is an example program
Goodbye world"""

diff = difflib.unified_diff(
    text1.splitlines(),
    text2.splitlines(), 
    fromfile="text1.txt",
    tofile="text2.txt",
)
print("\n".join(diff))


Output:

--- text1.txt

+++ text2.txt

@@ -1,3 +1,3 @@

Hello world
-This is an example
+This is an example program
Goodbye world


找出最相近的字串

接下來的才是我這次研究用來改進我程式碼的用途:找出最相近的候選字串。因為我在透過我的 LLM 生成報告分數萃取時,會有幻覺問題產生,產生一個跟真正報告分數極度接近、但又不同的幻覺數字;此時,我就可以透過 get_close_matches() 去找出最接近的正確答案。

import difflib

word = "8.550"
word_list = ["8.50", "9.32", "0.50", "0.550"]

# Find the close matches
matches = difflib.get_close_matches(word, word_list, n=3, cutoff=0.6)
print(matches)


Output:

['8.50', '0.550', '0.50']

References


Read More

Tags:

Leave a Reply