Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distance matrix creation using nparray with pdist and squareform

I'm trying to cluster using DBSCAN (scikit learn implementation) and location data. My data is in np array format, but to use DBSCAN with Haversine formula I need to create a distance matrix. I'm getting the following error when I try to do this( a 'module' not callable error.) From what i've reading online this is an import error, but I'm pretty sure thats not the case for me. I've created my own haversine distance formula, but I'm sure the error is not with this.

This is my input data, an np array (ResultArray).

[[ 53.3252628   -6.2644198 ]
[ 53.3287395   -6.2646543 ]
[ 53.33321202  -6.24785807]
[ 53.3261015   -6.2598324 ]
[ 53.325291    -6.2644105 ]
[ 53.3281323   -6.2661467 ]
[ 53.3253074   -6.2644483 ]
[ 53.3388147   -6.2338417 ]
[ 53.3381102   -6.2343826 ]
[ 53.3253074   -6.2644483 ]
[ 53.3228188   -6.2625379 ]
[ 53.3253074   -6.2644483 ]]

And this is the line of code that is erroring.

distance_matrix = sp.spatial.distance.squareform(sp.spatial.distance.pdist
(ResultArray,(lambda u,v: haversine(u,v))))

This is the error message:

File "Location.py", line 48, in <module>
distance_matrix = sp.spatial.distance.squareform(sp.spatial.distance.pdist
(ResArray,(lambda u,v: haversine(u,v))))
File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1118, in pdist
dm[k] = dfun(X[i], X[j])
File "Location.py", line 48, in <lambda>
distance_matrix = sp.spatial.distance.squareform(sp.spatial.distance.pdist
(ResArray,(lambda u,v: haversine(u,v))))
TypeError: 'module' object is not callable

I import scipy as sp. ( import scipy as sp )

like image 395
TheBaywatchKid Avatar asked Feb 27 '14 22:02

TheBaywatchKid


2 Answers

With Scipy you can define a custom distance function as suggested by the documentation at this link and reported here for convenience:

Y = pdist(X, f)
Computes the distance between all pairs of vectors in X using the user supplied 2-arity function f. For example, Euclidean distance between the vectors could be computed as follows:

dm = pdist(X, lambda u, v: np.sqrt(((u-v)**2).sum()))

Here I report my version of the code inspired on the code from this link:

from numpy import sin,cos,arctan2,sqrt,pi # import from numpy
# earth's mean radius = 6,371km
EARTHRADIUS = 6371.0

def getDistanceByHaversine(loc1, loc2):
    '''Haversine formula - give coordinates as a 2D numpy array of
    (lat_denter link description hereecimal,lon_decimal) pairs'''
    #      
    # "unpack" our numpy array, this extracts column wise arrays
    lat1 = loc1[1]
    lon1 = loc1[0]
    lat2 = loc2[1]
    lon2 = loc2[0]
    #
    # convert to radians ##### Completely identical
    lon1 = lon1 * pi / 180.0
    lon2 = lon2 * pi / 180.0
    lat1 = lat1 * pi / 180.0
    lat2 = lat2 * pi / 180.0
    #
    # haversine formula #### Same, but atan2 named arctan2 in numpy
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = (sin(dlat/2))**2 + cos(lat1) * cos(lat2) * (sin(dlon/2.0))**2
    c = 2.0 * arctan2(sqrt(a), sqrt(1.0-a))
    km = EARTHRADIUS * c
    return km

And calling in the following way:

D = spatial.distance.pdist(A, lambda u, v: getDistanceByHaversine(u,v))

In my implementation the matrix A has as first column the longitude values and as second column the latitude values expressed in decimal degrees.

like image 184
TommasoF Avatar answered Nov 15 '22 03:11

TommasoF


Please refer to @TommasoF answer. This answer is wrong: pdist allows to choose a custom distance function. I will delete the answer once it is not anymore chosen as the correct answer.

Simply scipy's pdist does not allow to pass in a custom distance function. As you can read in the docs, you have some options, but haverside distance is not within the list of supported metrics.

(Matlab pdist does support the option though, see here)

you need to do the calculation "manually", i.e. with loops, something like this will work:

from numpy import array,zeros

def haversine(lon1, lat1, lon2, lat2):
    """  See the link below for a possible implementation """
    pass

#example input (your's, truncated)
ResultArray = array([[ 53.3252628, -6.2644198 ],
                     [ 53.3287395  , -6.2646543 ],
                     [ 53.33321202 , -6.24785807],
                     [ 53.3253074  , -6.2644483 ]])

N = ResultArray.shape[0]
distance_matrix = zeros((N, N))
for i in xrange(N):
    for j in xrange(N):
        lati, loni = ResultArray[i]
        latj, lonj = ResultArray[j]
        distance_matrix[i, j] = haversine(loni, lati, lonj, latj)
        distance_matrix[j, i] = distance_matrix[i, j]

print distance_matrix
[[ 0.          0.38666203  1.41010971  0.00530489]
 [ 0.38666203  0.          1.22043364  0.38163748]
 [ 1.41010971  1.22043364  0.          1.40848782]
 [ 0.00530489  0.38163748  1.40848782  0.        ]]

Just for reference, an implementation in Python of Haverside can be found here.

like image 39
gg349 Avatar answered Nov 15 '22 04:11

gg349