Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String Distance Matrix in Python using pdist

How to calculate Jaro Winkler distance matrix of strings in Python?

I have a large array of hand-entered strings (names and record numbers) and I'm trying to find duplicates in the list, including duplicates that may have slight variations in spelling. A response to a similar question suggested using Scipy's pdist function with a custom distance function. I've tried to implement this solution with the jaro_winkler function in the Levenshtein package. The problem with this is that the jaro_winkler function requires a string input, whereas the pdict function seems to require a 2D array input.

Example:

import numpy as np
from scipy.spatial.distance import pdist
from Levenshtein import jaro_winkler

fname = np.array(['Bob','Carl','Kristen','Calr', 'Doug']).reshape(-1,1)
dm = pdist(fname, jaro_winkler)
dm = squareform(dm)

Expected Output - Something like this:

          Bob  Carl   Kristen  Calr  Doug
Bob       1.0   -        -       -     -
Carl      0.0   1.0      -       -     -
Kristen   0.0   0.46    1.0      -     -
Calr      0.0   0.93    0.46    1.0    -
Doug      0.53  0.0     0.0     0.0   1.0

Actual Error:

jaro_winkler expected two Strings or two Unicodes

I'm assuming this is because the jaro_winkler function is seeing an ndarray instead of a string, and I'm not sure how to convert the function input to a string in the context of the pdist function.

Does anyone have a suggestion to allow this to work? Thanks in advance!

like image 644
Mark W Avatar asked Sep 27 '17 16:09

Mark W


People also ask

What is Cdist in Python?

cdist(array, axis=0) function calculates the distance between each pair of the two collections of inputs. Parameters : array: Input array or object having the elements to calculate the distance between each pair of the two collections of inputs.

What does Scipy spatial distance Pdist do?

Computes the distance between m points using Euclidean distance (2-norm) as the distance metric between the points. The points are arranged as m n-dimensional row vectors in the matrix X.

What is Pdist?

pdist: Partitioned Distance Function pdist strictly computes distances across the two matrices, not within the same matrix, making computations significantly faster for certain use cases. Version: 1.2.1.


1 Answers

You need to wrap the distance function, like I demonstrated in the following example with the Levensthein distance

import numpy as np    
from Levenshtein import distance
from scipy.spatial.distance import pdist, squareform

# my list of strings
strings = ["hello","hallo","choco"]

# prepare 2 dimensional array M x N (M entries (3) with N dimensions (1)) 
transformed_strings = np.array(strings).reshape(-1,1)

# calculate condensed distance matrix by wrapping the Levenshtein distance function
distance_matrix = pdist(transformed_strings,lambda x,y: distance(x[0],y[0]))

# get square matrix
print(squareform(distance_matrix))

Output:
array([[ 0.,  1.,  4.],
       [ 1.,  0.,  4.],
       [ 4.,  4.,  0.]])
like image 51
Rick Avatar answered Sep 29 '22 20:09

Rick