Skip to content

[Python] Using the difflib Module to Compare Sequence Differences

Last Updated on 2024-08-28 by Clay

difflib is a module in the Python standard library used to compare differences between sequences (often text). Back when I was doing my thesis, I implemented this by hand. It's funny and a bit frustrating to realize now in my work that there's such a neat module for this.


Usage

Compare Similarity

difflib has SequenceMatcher as its core method, which can directly compare the differences between two sequences. The first parameter, set to None, is isjunk, which allows us to specify elements or characters to ignore. These are usually considered 'junk' elements.

import difflib

# Strings
str1 = "apple pie"
str2 = "apple pies"

# SequenceMatche
matcher = difflib.SequenceMatcher(None, str1, str2)

# Get similarity
similarity = matcher.ratio()
print(f"similarity: {similarity:.2f}")


Output:

similarity: 0.95


Additionally, it can also be used to compare arrays:

import difflib

# Arrays
arr1 = ["abc", "bca"]
arr2 = ["apple pies", "abc"]

# SequenceMatche
matcher = difflib.SequenceMatcher(None, arr1, arr2)

# Get similarity
similarity = matcher.ratio()
print(f"similarity: {similarity:.2f}")


Output:

similarity: 0.50


Find Differences

difflib also provides ndiff and unified_diff to compare the specific differences between two texts.

import difflib

text1 = """Hello world
This is an example
Goodbye world"""

text2 = """Hello world
This is an example program
Goodbye world"""

diff = difflib.ndiff(text1.splitlines(), text2.splitlines())
print("\n".join(diff))


Output:

  Hello world
- This is an example
+ This is an example program
? ++++++++

Goodbye world


The unified_diff format is another way to display differences:

import difflib

text1 = """Hello world
This is an example
Goodbye world"""

text2 = """Hello world
This is an example program
Goodbye world"""

diff = difflib.unified_diff(
    text1.splitlines(),
    text2.splitlines(), 
    fromfile="text1.txt",
    tofile="text2.txt",
)
print("\n".join(diff))


Output:

--- text1.txt

+++ text2.txt

@@ -1,3 +1,3 @@

Hello world
-This is an example
+This is an example program
Goodbye world


Find Closest Matches

Next, here's the main feature I used in my code improvement: finding the closest candidate strings. When I'm extracting report scores through my LLM, hallucination issues occur, generating a number that is very close to the real report score but not quite correct.

In this case, I can use get_close_matches() to find the closest correct answer.

import difflib

word = "8.550"
word_list = ["8.50", "9.32", "0.50", "0.550"]

# Find the close matches
matches = difflib.get_close_matches(word, word_list, n=3, cutoff=0.6)
print(matches)


Output:

['8.50', '0.550', '0.50']

References


Read More

Tags:

Leave a Reply