Last Updated on 2024-08-28 by Clay
difflib
is a module in the Python standard library used to compare differences between sequences (often text). Back when I was doing my thesis, I implemented this by hand. It’s funny and a bit frustrating to realize now in my work that there’s such a neat module for this.
Usage
Compare Similarity
difflib
has SequenceMatcher
as its core method, which can directly compare the differences between two sequences. The first parameter, set to None
, is isjunk
, which allows us to specify elements or characters to ignore. These are usually considered ‘junk’ elements.
import difflib
# Strings
str1 = "apple pie"
str2 = "apple pies"
# SequenceMatche
matcher = difflib.SequenceMatcher(None, str1, str2)
# Get similarity
similarity = matcher.ratio()
print(f"similarity: {similarity:.2f}")
Output:
similarity: 0.95
Additionally, it can also be used to compare arrays:
import difflib
# Arrays
arr1 = ["abc", "bca"]
arr2 = ["apple pies", "abc"]
# SequenceMatche
matcher = difflib.SequenceMatcher(None, arr1, arr2)
# Get similarity
similarity = matcher.ratio()
print(f"similarity: {similarity:.2f}")
Output:
similarity: 0.50
Find Differences
difflib
also provides ndiff
and unified_diff
to compare the specific differences between two texts.
import difflib
text1 = """Hello world
This is an example
Goodbye world"""
text2 = """Hello world
This is an example program
Goodbye world"""
diff = difflib.ndiff(text1.splitlines(), text2.splitlines())
print("\n".join(diff))
Output:
Hello world
- This is an example
+ This is an example program
? ++++++++
Goodbye world
The unified_diff
format is another way to display differences:
import difflib
text1 = """Hello world
This is an example
Goodbye world"""
text2 = """Hello world
This is an example program
Goodbye world"""
diff = difflib.unified_diff(
text1.splitlines(),
text2.splitlines(),
fromfile="text1.txt",
tofile="text2.txt",
)
print("\n".join(diff))
Output:
--- text1.txt
+++ text2.txt
@@ -1,3 +1,3 @@
Hello world
-This is an example
+This is an example program
Goodbye world
Find Closest Matches
Next, here’s the main feature I used in my code improvement: finding the closest candidate strings. When I’m extracting report scores through my LLM, hallucination issues occur, generating a number that is very close to the real report score but not quite correct.
In this case, I can use get_close_matches()
to find the closest correct answer.
import difflib
word = "8.550"
word_list = ["8.50", "9.32", "0.50", "0.550"]
# Find the close matches
matches = difflib.get_close_matches(word, word_list, n=3, cutoff=0.6)
print(matches)
Output:
['8.50', '0.550', '0.50']