Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speeding up calculation of nearby groups?

I have a data frame that contains a group ID, two distance measures (longitude/latitude type measure), and a value. For a given set of distances, I want to find the number of other groups nearby, and the average values of those other groups nearby.

I've written the following code, but it is so inefficient that it simply does not complete in a reasonable time for very large data sets. The calculation of nearby retailers is quick. But the calculation of the average value of nearby retailers is extremely slow. Is there a better way to make this more efficient?

distances = [1,2]

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)),
                  columns=['Group','Dist1','Dist2','Value'])

# get one row per group, with the two distances for each row
df_groups = df.groupby('Group')[['Dist1','Dist2']].mean()

# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])

# find points within a given radius
for i in distances:
    closeby = tree.query_ball_tree(tree, r=i)

    # put into density column
    df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]

    # get average values of nearby groups
    for idx, val in enumerate(df_groups.index):
        val_idx = df_groups.iloc[closeby[idx]].index.values
        mean = df.loc[df['Group'].isin(val_idx), 'Value'].mean()
        df_groups.loc[val, str(i) + '_mean_values'] = mean

    # merge back to dataframe
    df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles', 
                                 str(i) + '_mean_values']], 
                  left_on='Group', 
                  right_index=True)
like image 647
user1566200 Avatar asked Jul 28 '17 12:07

user1566200


1 Answers

Its clear that the problem is indexing the main dataframe, with the isin method. As the dataframe grows in length a much larger search has to be done. I propose you do that same search, on the smaller df_groups data frame and calculate an updated average instead.

df = pd.DataFrame(np.random.randint(0,100,size=(100000, 4)),
                  columns=['Group','Dist1','Dist2','Value'])
distances = [1,2]
# get means of all values and count, the totals for each sample
df_groups = df.groupby('Group')[['Dist1','Dist2','Value']].agg({'Dist1':'mean','Dist2':'mean',
                                                                  'Value':['mean','count']})
# remove multicolumn index
df_groups.columns = [' '.join(col).strip() for col in df_groups.columns.values]
 #Rename columns 
df_groups.rename(columns={'Dist1 mean':'Dist1','Dist2 mean':'Dist2','Value mean':'Value','Value count':
                          'Count'},inplace=True)


# create KDTree for quick searching
tree = cKDTree(df_groups[['Dist1','Dist2']])

for i in distances:
    closeby = tree.query_ball_tree(tree, r=i)
    # put into density column
    df_groups['groups_within_' + str(i) + 'miles'] = [len(x) for x in closeby]
    #create column to look for subsets
    df_groups['subs'] = [df_groups.index.values[idx] for idx in closeby]
    #set this column to prep updated mean calculation
    df_groups['ComMean'] = df_groups['Value'] * df_groups['Count']

    #perform updated mean
    df_groups[str(i) + '_mean_values'] = [(df_groups.loc[df_groups.index.isin(row), 'ComMean'].sum() /
                                          df_groups.loc[df_groups.index.isin(row), 'Count'].sum()) for row in df_groups['subs']]
    df = pd.merge(df, df_groups[['groups_within_' + str(i) + 'miles',
                                 str(i) + '_mean_values']],
                  left_on='Group',
                  right_index=True)

the formula for and upated mean is just (m1*n1 + m2*n2)/(n1+n2)

old setup 

100000 rows
%timeit old(df)
1 loop, best of 3: 694 ms per loop

1000000 rows
%timeit old(df)
1 loop, best of 3: 6.08 s per loop

10000000 rows
%timeit old(df)
1 loop, best of 3: 6min 13s per loop

new setup

100000 rows
%timeit new(df)
10 loops, best of 3: 136 ms per loop

1000000 rows
%timeit new(df)
1 loop, best of 3: 525 ms per loop

10000000 rows
%timeit new(df)
1 loop, best of 3: 4.53 s per loop
like image 53
DJK Avatar answered Sep 22 '22 15:09

DJK