Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python- Selecting pairs of objects from a data frame

Tags:

python

pandas

I have a data frame that contains information about the positions of various objects, and a unique index for each object (index in this case is not related to the data frame). Here is some example data:

                     ind    pos
   x    y    z      
-1.0    7.0  0.0      21    [-2.76788330078, 217.786453247, 26.6822681427]
             0.0      22    [-7.23852539062, 217.274139404, 26.6758270264]
        0.0  1.0      152   [-0.868591308594, 2.48404550552, 48.4036369324]
        6.0  2.0      427   [-0.304443359375, 182.772140503, 79.4475860596]

The actual data frame is quite long. I have written a function that takes two vectors as inputs and outputs the distance between them:

def dist(a, b):
    diff = N.array(a)-N.array(b)    
    d = N.sqrt(N.dot(diff, diff))
    return d

and a function that, given two arrays, will output all the unique combinations of elements between these arrays:

def getPairs(a, b):
    if N.array_equal(a, b):
        pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(i+1, 
        len(b))]
    else:
        pairs = [(a[i], b[j]) for i in range(len(a)) for j in range(len(b))]
    return pairs

I want to take my data frame and find all the pairs of elements whose distance between them is less than some value, say 30. For the pairs that meet this requirement, I also need to store the distance I calculated in some other data frame. Here is my attempt at solving this, but this turned out to be extremely slow.

pairs = [getPairs(list(group.ind), list(boxes.get_group((name[0]+i, name[1]+j, name[2]+k)).ind)) \
    for i in [0,1] for j in [0,1] for k in [0,1] if name[0]+i != 34 and name[1]+j != 34 and name[2]+k != 34]



pairs = list(itertools.chain(*pairs))

subInfo = pandas.DataFrame()
subInfo['pairs'] = pairs

subInfo['r'] = subInfo.pairs.apply(lambda x: dist(df_yz.query('ind == @x[0]').pos[0], df_yz.query('ind == @x[1]').pos[0]))

Don't worry about what I'm iterating over in this for loop, it works for the system I'm dealing with and isn't where I'm getting slowed down. The step I use .query() is where the major jam happens.

The output I am looking for is something like:

pair          distance
(21, 22)      22.59
(21, 152)     15.01
(22, 427)     19.22

I made the distances up, and the pair list would be much longer, but that's the basic idea.

like image 613
Andrew King Avatar asked Mar 08 '26 06:03

Andrew King


1 Answers

Took me a while, but here are thee possible solution. Hope they are self explanatory. Written in Python 3.x in Jupyter Notebook. One remark: if your coordinates are world coordinates, you may think of using the Haversine distance (circular distance) instead of the Euclidean distance which is a straight line.

First, create your data

import pandas as pd
import numpy as np

values = [
    { 'x':-1.0, 'y':7.0, 'z':0.0, 'ind':21, 'pos':[-2.76788330078, 217.786453247, 26.6822681427] },
    { 'z':0.0, 'ind':22, 'pos':[-7.23852539062, 217.274139404, 26.6758270264] },
    { 'y':0.0, 'z':1.0, 'ind':152, 'pos':[-0.868591308594, 2.48404550552, 48.4036369324] },
    { 'y':6.0, 'z':2.0, 'ind':427, 'pos':[-0.304443359375, 182.772140503, 79.4475860596] }
]

def dist(a, b):
    """
    Calculates the Euclidean distance between two 3D-vectors.
    """
    diff = np.array(a) - np.array(b)    
    d = np.sqrt(np.dot(diff, diff))
    return d


df_initial = pd.DataFrame(values)

The following three solutions will generate this output:

    pairs   distance
1   (21, 22)    4.499905
3   (21, 427)   63.373886
7   (22, 427)   63.429709

First solution is based on a full join of the data with itself. Downside is that it may exceed your memory if the dataset is huge. Advantages are the easy readability of the code and the usage of Pandas only:

#%%time 

df = df_initial.copy()

# join data with itself, each line will contain two geo-positions
df['tmp'] = 1
df = df.merge(df, on='tmp', suffixes=['1', '2']).drop('tmp', axis=1)

# remove rows with similar index
df = df[df['ind1'] != df['ind2']]

# calculate distance for all
df['distance'] = df.apply(lambda row: dist(row['pos1'], row['pos2']), axis=1)

# filter only those within a specific distance
df = df[df['distance'] < 70]

# combine original indices into a tuple
df['pairs'] = list(zip(df['ind1'], df['ind2']))

# select columns of interest
df = df[['pairs', 'distance']]

def sort_tuple(idx):
    x, y = idx
    if y < x:
        return y, x
    return x, y

# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)

# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)

# print result
df

The second solution tries to avoid the memory issue of the first version by iterating over the original data line by line and calculating the distance between the current line and the original data while keeping only values that satisfy the minimum distance constraint. I was expecting a bad performance, but wasn't bad at all (see summary at the end).

#%%time

df = df_initial.copy()

results = list()
for index, row1 in df.iterrows():
    # calculate distance between current coordinate and all original rows in the data
    df['distance'] = df.apply(lambda row2: dist(row1['pos'], row2['pos']), axis=1)

    # filter only those within a specific distance and drop rows with same index as current coordinate
    df_tmp = df[(df['distance'] < 70) & (df['ind'] != row1['ind'])].copy()

    # prepare final data
    df_tmp['ind2'] = row1['ind']
    df_tmp['pairs'] = list(zip(df_tmp['ind'], df_tmp['ind2']))

    # remember data
    results.append(df_tmp)

# combine all into one dataframe
df = pd.concat(results)

# select columns of interest
df = df[['pairs', 'distance']]

def sort_tuple(idx):
    x, y = idx
    if y < x:
        return y, x
    return x, y

# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)

# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)

# print result
df

The third solution is based on spatial operations using the KDTree from Scipy.

#%%time
from scipy import spatial

tree = spatial.KDTree(list(df_initial['pos']))

# calculate distances (returns a sparse matrix)
distances = tree.sparse_distance_matrix(tree, max_distance=70)

# convert to a Coordinate (coo) representation of the Compresses-Sparse-Column (csc) matrix.
coo = distances.tocoo(copy=False)

def get_cell_value(idx: int, column: str = 'ind'):
    return df_initial.iloc[idx][column]

def extract_indices(row):
    distance, idx1, idx2 = row
    return get_cell_value(int(idx1)), get_cell_value(int(idx2))

df = pd.DataFrame({'idx1': coo.row, 'idx2': coo.col, 'distance': coo.data})
df['pairs'] = df.apply(extract_indices, axis=1)

# select columns of interest
df = df[['pairs', 'distance']]

def sort_tuple(idx):
    x, y = idx
    if y < x:
        return y, x
    return x, y

# sort values of each tuple from low to high
df['pairs'] = df['pairs'].apply(sort_tuple)

# drop duplicates
df.drop_duplicates(subset=['pairs'], inplace=True)

# print result
df

So what about performance. If you just want to know which row of your original data is within the desired distance, then the KDTree version (third version) is super fast. It took just 4ms to generate the sparse matrix. But since I then used the indices from that matrix to extract the data from the original data, the performance dropped. Of course this should be tested on your full dataset.

  • version 1: 93.4 ms
  • version 2: 42.2 ms
  • version 3: 52.3 ms (4 ms)
like image 180
Matthias Avatar answered Mar 10 '26 20:03

Matthias



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!