Skip to content

[NLP] Use Python to Calculate the Minimum Edit Distance of Two Sentences

When we processing natural language processing task, sometimes we may need to determine the similarity of two sentences.

Of course, it is also possible that you want to determine the similarity between texts, not just sentences.

There are many popular methods, such as using word vectors, sentence vectors, and TF-IDF to calculate similarity, but you can also use the classic Minimum Edit Distance (MED) to calculate, and the calculation speed is also very fast.

This article records the formula for the minimum edit distance and how to use python to implement such a calculation method.


What Is "Minimum Edit Distance" ?

First, let's read the explanation of Wiki:

In computational linguistics and computer scienceedit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other.
—— From Wiki

In computational linguistics and computer scienceedit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question. In bioinformatics, it can be used to quantify the similarity of DNA sequences, which can be viewed as strings of the letters A, C, G and T.
Different definitions of an edit distance use different sets of string operations. Levenshtein distance operations are the removal, insertion, or substitution of a character in the string. Being the most common metric, the term Levenshtein distance is often used interchangeably with edit distance.[1]

Different types of edit distance allow different sets of string operations. For instance:
- The Levenshtein distance allows deletion, insertion and substitution.
- The Longest common subsequence (LCS) distance allows only insertion and deletion, not substitution.
- The Hamming distance allows only substitution, hence, it only applies to strings of the same length.
- The Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters.

The Jaro distance allows only transposition.
Some edit distances are defined as a parameterizable metric calculated with a specific set of allowed edit operations, and each operation is assigned a cost (possibly infinite). This is further generalized by DNA sequence alignment algorithms such as the Smith–Waterman algorithm, which make an operation's cost depend on where it is applied.


For exmaple
The Levenshtein distance between "kitten" and "sitting" is 3. A minimal edit script that transforms the former into the latter is:
kitten → sitten (substitute "s" for "k")
sitten → sittin (substitute "i" for "e")
sittin → sitting (insert "g" at the end)

LCS distance (insertions and deletions only) gives a different distance and minimal edit script:
kitten → itten (delete "k" at 0)
itten → sitten (insert "s" at 0)
sitten → sittn (delete "e" at 4)
sittn → sittin (insert "i" at 4)
sittin → sitting (insert "g" at 6)

Simply put, the so-called minimum edit distance refers to the minimum steps of how to completely replace one sentence with another sentence through the use of "substitute", "insert" and "delete".

This is the minimum edit distance between two sentences.


Algorithm

First, let's look at the pseudocode:


Actually I think it's quite troublesome...haha, I originally program a version, although it can run, but it seems that there is no correct solution of the standard answer on the online problem-solving website, so I finally tried to remake my code.

def MED(sent_01, sent_02):
    n = len(sent_01)
    m = len(sent_02)

    matrix = [[i+j for j in range(m+1)] for i in range(n+1)]

    for i in range(1, n+1):
        for j in range(1, m+1):
            if sent_01[i-1] == sent_02[j-1]:
                d = 0
            else:
                d = 1

            matrix[i][j] = min(matrix[i-1][j]+1, matrix[i][j-1]+1, matrix[i-1][j-1]+d)

    distance_score = matrix[n][m]
   
    return distance_score



Reference


Read More

Tags:

Leave a ReplyCancel reply

Exit mobile version