Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - remove duplicate rows except the one with highest value from another column

I have a large data frame (more than 100 columns, and several 100 thousand rows) with a number of rows that contain duplicate data. I am trying to remove the duplicate rows, keeping the one with the largest value in a different column.

Essentially, I am sorting the data into individual bins based on time period, so across periods, one would expect to find a lot of duplication, as most entities exist across all time periods. What can't be allowed, however, is for the same entity to appear more than once in a given time period.

I tried the approach in python pandas: Remove duplicates by columns A, keeping the row with the highest value in column B, on a subset of the data, with the plan to recombine with the original dataframe, df.

Example data subset:

              unique_id   period_id   liq
index                                   
19            CAN00CE0     199001  0.017610
1903          **USA07WG0** 199001  1.726374
12404         **USA07WG0** 199001  0.090525
13330         USA08DE0     199001  1.397143
14090         USA04U80     199001  2.000716
12404         USA07WG0     199002  0.090525
13330         USA08DE0     199002  1.397143
14090         USA04U80     199002  2.000716

In the example above, I would like to keep the first instance (as liq is higher with 1.72) and discard the second instance (liq is lower, with 0.09). Note that there can be more than two duplicates in a given period_id.

I tried this was but it was very slow for me (I stopped it after more than 5 minutes):

def h(x):
    x = x.dropna() #idmax fails on nas, and happy to throw out where liq is na.
    return x.ix[x.liq.idmax()]

df.groupby([‘holt_unique_id’, ‘period_id’], group_keys = False).apply(lambda x: h(x))

I ultimately did the below, which is more verbose and ugly, and simply throws out all but one duplicate, but this is also very slow! given the speed of other operations of similar complexity, I thought I would ask here for a better solution.

So my request is really to fix the above code so that it is fast, the below is given as guidance, and if in the vein of the below, perhaps I could also have discarded the duplicates based on index, rather than the reset_index/set_index approach that I have employed:

def do_remove_duplicates(df):
    sub_df = df[['period_id', 'unique_id']] 
    grp = sub_df.groupby(['period_id', 'unique_id'], as_index = False)
    cln = grp.apply(lambda x: x.drop_duplicates(cols = 'unique_id'))   #apply drop_duplicates.  This line is the slow bit!
    cln = cln.reset_index()   #remove the index stuff that has been added
    del(cln['level_0'])   #remove the index stuff that has been added
    cln.set_index('level_1', inplace = True)   #set the index back to the original (same as df).
    df_cln = cln.join(df, how = 'left', rsuffix = '_right')   # join the cleaned dataframe with the original, discarding the duplicate rows using a left join.
    return df_cln
like image 470
Carl Avatar asked Sep 25 '22 20:09

Carl


1 Answers

Hows about this:

  • update all your columns with the max data.
  • pick a row (say the first).

This should be much faster as it's vectorized.

In [11]: g = df.groupby(["unique_id", "period_id"], as_index=False)

In [12]: g.transform("max")
Out[12]:
            liq
index
19     0.017610
1903   1.726374
12404  1.726374
13330  1.397143
14090  2.000716
12404  0.090525
13330  1.397143
14090  2.000716

In [13]: df.update(g.transform("max"))

In [14]: g.nth(0)
Out[14]:
          unique_id  period_id       liq
index
19         CAN00CE0     199001  0.017610
1903   **USA07WG0**     199001  1.726374
13330      USA08DE0     199001  1.397143
14090      USA04U80     199001  2.000716
12404      USA07WG0     199002  0.090525
13330      USA08DE0     199002  1.397143
14090      USA04U80     199002  2.000716

Note: I'd like to use groupby first or last here but I think there's a bug where they throw away your old index, I don't think they should... nth is the works however.


An alternative is to first slice out the ones which don't equal liq max:

(df[df["liq"] == g["liq"].transform("max")]  #  keep only max liq rows
 .groupby(["unique_id", "period_id"])
 .nth(0)
like image 52
Andy Hayden Avatar answered Sep 29 '22 06:09

Andy Hayden