Archive

Posts Tagged ‘distance’

string distances

August 17, 2016 Leave a comment

See the Jellyfish project: “Jellyfish is a python library for doing approximate and phonetic matching of strings“.

Jellyfish implements the following algorithms: Levenshtein Distance, Damerau-Levenshtein Distance, Jaro Distance, Jaro-Winkler Distance, Match Rating Approach Comparison, Hamming Distance.

See the project page for more info.

Categories: python Tags: ,

Levenshtein distance

October 19, 2010 Leave a comment

The Levenshtein distance (or edit distance) between two strings is the minimal number of “edit operations” required to change one string into the other. The two strings can have different lengths. There are three kinds of “edit operations”: deletion, insertion, or alteration of a character in either string.

Example: the Levenshtein distance of “ag-tcc” and “cgctca” is 3.

#!/usr/bin/env python

def LD(s,t):
    s = ' ' + s
    t = ' ' + t
    d = {}
    S = len(s)
    T = len(t)
    for i in range(S):
        d[i, 0] = i
    for j in range (T):
        d[0, j] = j
    for j in range(1,T):
        for i in range(1,S):
            if s[i] == t[j]:
                d[i, j] = d[i-1, j-1]
            else:
                d[i, j] = min(d[i-1, j] + 1, d[i, j-1] + 1, d[i-1, j-1] + 1)
    return d[S-1, T-1]

a = 'ag-tcc'
b = 'cgctca'

print LD(a, b)   # 3

The implementation is from here.

Categories: python Tags: ,

Hamming distance

October 19, 2010 Leave a comment

The Hamming distance is defined between two strings of equal length. It measures the number of positions with mismatching characters.

Example: the Hamming distance between “toned” and “roses” is 3.

#!/usr/bin/env python

def hamming_distance(s1, s2):
    assert len(s1) == len(s2)
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

if __name__=="__main__":
    a = 'toned'
    b = 'roses'
    print hamming_distance(a, b)   # 3

If you need the number of matching character positions:

#!/usr/bin/env python

def similarity(s1, s2):
    assert len(s1) == len(s2)
    return sum(ch1 == ch2 for ch1, ch2 in zip(s1, s2))

if __name__=="__main__":
    a = 'toned'
    b = 'roses'
    print similarity(a, b)    # 2

Actually this is equal to len(s1) - hamming_distance(s1, s2). Remember, len(s1) == len(s2).

More info on zip() here.

Categories: python Tags: , , ,