Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python numpy pairwise edit-distance

So, I have a numpy array of strings, and I want to calculate the pairwise edit-distance between each pair of elements using this function: scipy.spatial.distance.pdist from http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html

A sample of my array is as follows:

 >>> d[0:10]
 array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT',
   'GATTT', 'TCTTT', 'ACTTT'], 
  dtype='|S5')

However, since it doesn't have the 'editdistance' option, therefore, I want to give a customized distance function. I tried this and I faced the following error:

 >>> import editdist
 >>> import scipy
 >>> import scipy.spatial
 >>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double
    X = np.double(X)
ValueError: could not convert string to float: TTTTT
like image 424
Vahid Mirjalili Avatar asked Nov 23 '22 19:11

Vahid Mirjalili


1 Answers

If you really must use pdist, you first need to convert your strings to numeric format. If you know that all strings will be the same length, you can do this rather easily:

numeric_d = d.view(np.uint8).reshape((len(d),-1))

This simply views your array of strings as a long array of uint8 bytes, then reshapes it such that each original string is on a row by itself. In your example, this would look like:

In [18]: d.view(np.uint8).reshape((len(d),-1))
Out[18]:
array([[84, 84, 84, 84, 84],
       [65, 84, 84, 84, 84],
       [67, 84, 84, 84, 84],
       [71, 84, 84, 84, 84],
       [84, 65, 84, 84, 84],
       [65, 65, 84, 84, 84],
       [67, 65, 84, 84, 84],
       [71, 65, 84, 84, 84],
       [84, 67, 84, 84, 84],
       [65, 67, 84, 84, 84]], dtype=uint8)

Then, you can use pdist as you normally would. Just make sure that your editdist function is expecting arrays of integers, rather than strings. You could quickly convert your new inputs by calling .tostring():

def editdist(x, y):
  s1 = x.tostring()
  s2 = y.tostring()
  ... rest of function as before ...
like image 135
perimosocordiae Avatar answered Nov 25 '22 07:11

perimosocordiae