Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group near-duplicate values in a pandas dataframe?

If there are duplicate values in a DataFrame pandas already provides functions to replace or drop duplicates. In many experimental datasets on the other hand one might have 'near' duplicates.

How can one replace these near duplicate values with, e.g. their mean?

The example data looks as follows:

df = pd.DataFrame({'x': [1, 2,2.01, 3, 4,4.1,3.95, 5,], 
                   'y': [1, 2,2.2, 3, 4.1,4.4,4.01, 5.5]})

I tried to hack together something to bin together near duplicates but this is using for loops and seems like a hack against pandas:

def cluster_near_values(df, colname_to_cluster, bin_size=0.1):

    used_x = [] # list of values already grouped
    group_index = 0
    for search_value in df[colname_to_cluster]:

        if search_value in used_x:
            # value is already in a group, skip to next
            continue

        g_ix = df[abs(df[colname_to_cluster]-search_value) < bin_size].index
        used_x.extend(df.loc[g_ix, colname_to_cluster])
        df.loc[g_ix, 'cluster_group'] = group_index
       group_index += 1

    return df.groupby('cluster_group').mean()

Which does the grouping and averaging:

print(cluster_near_values(df, 'x', 0.1))

                  x     y
cluster_group                
0.0            1.000000  1.00
1.0            2.005000  2.10
2.0            3.000000  3.00
3.0            4.016667  4.17
4.0            5.000000  5.50

Is there a better way to achieve this?

like image 568
Alexander Avatar asked Apr 25 '18 09:04

Alexander


People also ask

How do you group similar rows in Pandas?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.


1 Answers

Here's an example, where you want to group items to one digit of precision. You can modify this as needed. You can also modify this for binning values with threshold over 1.

df.groupby(np.ceil(df['x'] * 10) // 10).mean()    
            x     y
x                  
1.0  1.000000  1.00
2.0  2.005000  2.10
3.0  3.000000  3.00
4.0  4.016667  4.17
5.0  5.000000  5.50
like image 147
cs95 Avatar answered Oct 13 '22 18:10

cs95