Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently selecting rows from pandas dataframe using sorted column

I have a large-ish pandas dataframe with multiple columns (c1 ... c8) and ~32 mil rows. The dataframe is already sorted by c1. I want to grab other column values from rows that share a particular value of c1.

something like

keys = big_df['c1'].unique()
red = np.zeros(len(keys))
for i, key in enumerate(keys):
    inds = (big_df['c1'] == key)
    v1 = np.array(big_df.loc[inds]['c2'])
    v2 = np.array(big_df.loc[inds]['c6'])
    red[i] = reduce_fun(v1,v2)

However this turns out to be very slow I think because it checks the entire columns for the matching criterion (even though there might only be 10 rows out of 32 mil that are relevant). Since big_df is sorted by c1 and the keys is just the list of all unique c1's, is there a fast way to get the red[] array (ie i know the first row with the next key is the row after the last row of the previous key, I know that the last row for a key is the last row that matches the key, since all subsequent rows are guaranteed not to match).

Thanks,

Ilya

Edit: I am not sure what order unique() method produces, but I basically want to have for every key in keys a value of reduce_fun(), I don't particularly care what order they are (presumably the easiest order is the order c1 is already sorted in).

Edit2: I slightly restructured the code. Basically, is there an efficient way of constructing inds. big_df['c1'] == key takes 75.8% of total time in my data, while creating v1, v2 takes 21.6% according to line profiler.

like image 966
Ilya Avatar asked Dec 11 '25 17:12

Ilya


1 Answers

Rather than a list, I chose a dictionary to hold the reduced values keyed on each item in c1.

red = {key: reduce_func(frame['c2'].values, frame['c7'].values) 
       for key, frame in df.groupby('c1')}
like image 97
Alexander Avatar answered Dec 13 '25 05:12

Alexander



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!