Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scipy pdist() on a pandas DataFrame

I have a large dataframe (e.g. 15k objects), where each row is an object and the columns are the numeric object features. It is in the form:

df = pd.DataFrame({ 'A' : [0, 0, 1],
                    'B' : [2, 3, 4],
                    'C' : [5, 0, 1],
                    'D' : [1, 1, 0]},
                    columns= ['A','B', 'C', 'D'], index=['first', 'second', 'third'])

I want to calculate the pairwise distances of all objects (rows) and read that scipy's pdist() function is a good solution due to its computational efficiency. I can simply call:

res = pdist(df, 'cityblock')
res
>> array([ 6.,  8.,  4.])

And see that the res array contains the distances in the following order: [first-second, first-third, second-third].

My question is how can I get this in a matrix, dataframe or (less desirably) dict format so I know exactly which pair each distance value belongs to, like below:

       first second third
first    0      -     -
second   6      0     -
third    8      4     0

Eventually, I think having the distance matrix as a pandas DataFrame may be convenient, since I may apply some ranking and ordering operations per row (e.g. find the top N closest objects to object first).

like image 206
Zhubarb Avatar asked Oct 05 '15 10:10

Zhubarb


People also ask

What is Pdist Scipy?

scipy. stats. pdist(array, axis=0) function calculates the Pairwise distances between observations in n-dimensional space. Parameters : array: Input array or object having the elements to calculate the Pairwise distances.

How do you value using ILOC?

iloc[] to Get a Cell Value by Column Position. If you wanted to get a cell value by column number or index position use DataFrame. iloc[] , index position starts from 0 to length-1 (index starts from zero). In order to refer last column use -1 as the column position.

What does .values in pandas do?

The values property is used to get a Numpy representation of the DataFrame. Only the values in the DataFrame will be returned, the axes labels will be removed. The values of the DataFrame. A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.


1 Answers

Oh, I found the answer on this webpage. Apparently, there is a dedicated function for that named squareform(). Not deleting my question for the time being in case it may be helpful for someone else.

from scipy.spatial.distance import squareform
res = pdist(df, 'cityblock')
squareform(res)
pd.DataFrame(squareform(res), index=df.index, columns= df.index)
>>        first  second  third
>>first       0       6      8
>>second      6       0      4
>>third       8       4      0
like image 88
Zhubarb Avatar answered Sep 18 '22 13:09

Zhubarb