## string distances

See the Jellyfish project: “*Jellyfish is a python library for doing approximate and phonetic matching of strings*“.

Jellyfish implements the following algorithms: Levenshtein Distance, Damerau-Levenshtein Distance, Jaro Distance, Jaro-Winkler Distance, Match Rating Approach Comparison, Hamming Distance.

See the project page for more info.

## Levenshtein distance

The Levenshtein distance (or edit distance) between two strings is the minimal number of “edit operations” required to change one string into the other. The two strings can have different lengths. There are three kinds of “edit operations”: deletion, insertion, or alteration of a character in either string.

Example: the Levenshtein distance of “ag-tcc” and “cgctca” is 3.

#!/usr/bin/env python def LD(s,t): s = ' ' + s t = ' ' + t d = {} S = len(s) T = len(t) for i in range(S): d[i, 0] = i for j in range (T): d[0, j] = j for j in range(1,T): for i in range(1,S): if s[i] == t[j]: d[i, j] = d[i-1, j-1] else: d[i, j] = min(d[i-1, j] + 1, d[i, j-1] + 1, d[i-1, j-1] + 1) return d[S-1, T-1] a = 'ag-tcc' b = 'cgctca' print LD(a, b) # 3

The implementation is from here.

## Hamming distance

The Hamming distance is defined between two strings of equal length. It measures the number of positions with mismatching characters.

Example: the Hamming distance between “**t**o**n**e**d**” and “**r**o**s**e**s**” is 3.

#!/usr/bin/env python def hamming_distance(s1, s2): assert len(s1) == len(s2) return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2)) if __name__=="__main__": a = 'toned' b = 'roses' print hamming_distance(a, b) # 3

If you need the number of matching character positions:

#!/usr/bin/env python def similarity(s1, s2): assert len(s1) == len(s2) return sum(ch1 == ch2 for ch1, ch2 in zip(s1, s2)) if __name__=="__main__": a = 'toned' b = 'roses' print similarity(a, b) # 2

Actually this is equal to `len(s1) - hamming_distance(s1, s2)`

. Remember, `len(s1) == len(s2)`

.

More info on `zip()`

here.