I have a large data frame (more than 100 columns, and several 100 thousand rows) with a number of rows that contain duplicate data. I am trying to remove the duplicate rows, keeping the one with the largest value in a different column.
Essentially, I am sorting the data into individual bins based on time period, so across periods, one would expect to find a lot of duplication, as most entities exist across all time periods. What can't be allowed, however, is for the same entity to appear more than once in a given time period.
I tried the approach in python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B, on a subset of the data, with the plan to recombine with the original dataframe, df.
Example data subset:
unique_id period_id liq
index
19 CAN00CE0 199001 0.017610
1903 **USA07WG0** 199001 1.726374
12404 **USA07WG0** 199001 0.090525
13330 USA08DE0 199001 1.397143
14090 USA04U80 199001 2.000716
12404 USA07WG0 199002 0.090525
13330 USA08DE0 199002 1.397143
14090 USA04U80 199002 2.000716
In the example above, I would like to keep the first instance (as liq is higher with 1.72) and discard the second instance (liq is lower, with 0.09). Note that there can be more than two duplicates in a given period_id.
I tried this was but it was very slow for me (I stopped it after more than 5 minutes):
def h(x):
x = x.dropna() #idmax fails on nas, and happy to throw out where liq is na.
return x.ix[x.liq.idmax()]
df.groupby([‘holt_unique_id’, ‘period_id’], group_keys = False).apply(lambda x: h(x))
I ultimately did the below, which is more verbose and ugly, and simply throws out all but one duplicate, but this is also very slow! given the speed of other operations of similar complexity, I thought I would ask here for a better solution.
So my request is really to fix the above code so that it is fast, the below is given as guidance, and if in the vein of the below, perhaps I could also have discarded the duplicates based on index, rather than the reset_index/set_index approach that I have employed:
def do_remove_duplicates(df):
sub_df = df[['period_id', 'unique_id']]
grp = sub_df.groupby(['period_id', 'unique_id'], as_index = False)
cln = grp.apply(lambda x: x.drop_duplicates(cols = 'unique_id')) #apply drop_duplicates. This line is the slow bit!
cln = cln.reset_index() #remove the index stuff that has been added
del(cln['level_0']) #remove the index stuff that has been added
cln.set_index('level_1', inplace = True) #set the index back to the original (same as df).
df_cln = cln.join(df, how = 'left', rsuffix = '_right') # join the cleaned dataframe with the original, discarding the duplicate rows using a left join.
return df_cln
Hows about this:
This should be much faster as it's vectorized.
In [11]: g = df.groupby(["unique_id", "period_id"], as_index=False)
In [12]: g.transform("max")
Out[12]:
liq
index
19 0.017610
1903 1.726374
12404 1.726374
13330 1.397143
14090 2.000716
12404 0.090525
13330 1.397143
14090 2.000716
In [13]: df.update(g.transform("max"))
In [14]: g.nth(0)
Out[14]:
unique_id period_id liq
index
19 CAN00CE0 199001 0.017610
1903 **USA07WG0** 199001 1.726374
13330 USA08DE0 199001 1.397143
14090 USA04U80 199001 2.000716
12404 USA07WG0 199002 0.090525
13330 USA08DE0 199002 1.397143
14090 USA04U80 199002 2.000716
Note: I'd like to use groupby first or last here but I think there's a bug where they throw away your old index, I don't think they should... nth is the works however.
An alternative is to first slice out the ones which don't equal liq max:
(df[df["liq"] == g["liq"].transform("max")] # keep only max liq rows
.groupby(["unique_id", "period_id"])
.nth(0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With