How to group near-duplicate values in a pandas dataframe?

Tags:

If there are duplicate values in a DataFrame pandas already provides functions to replace or drop duplicates. In many experimental datasets on the other hand one might have 'near' duplicates.

How can one replace these near duplicate values with, e.g. their mean?

The example data looks as follows:

df = pd.DataFrame({'x': [1, 2,2.01, 3, 4,4.1,3.95, 5,], 
                   'y': [1, 2,2.2, 3, 4.1,4.4,4.01, 5.5]})

I tried to hack together something to bin together near duplicates but this is using for loops and seems like a hack against pandas:

def cluster_near_values(df, colname_to_cluster, bin_size=0.1):

    used_x = [] # list of values already grouped
    group_index = 0
    for search_value in df[colname_to_cluster]:

        if search_value in used_x:
            # value is already in a group, skip to next
            continue

        g_ix = df[abs(df[colname_to_cluster]-search_value) < bin_size].index
        used_x.extend(df.loc[g_ix, colname_to_cluster])
        df.loc[g_ix, 'cluster_group'] = group_index
       group_index += 1

    return df.groupby('cluster_group').mean()

Which does the grouping and averaging:

print(cluster_near_values(df, 'x', 0.1))

                  x     y
cluster_group                
0.0            1.000000  1.00
1.0            2.005000  2.10
2.0            3.000000  3.00
3.0            4.016667  4.17
4.0            5.000000  5.50

Is there a better way to achieve this?

568

asked Apr 25 '18 09:04

Alexander

1 Answers

Here's an example, where you want to group items to one digit of precision. You can modify this as needed. You can also modify this for binning values with threshold over 1.

df.groupby(np.ceil(df['x'] * 10) // 10).mean()    
            x     y
x                  
1.0  1.000000  1.00
2.0  2.005000  2.10
3.0  3.000000  3.00
4.0  4.016667  4.17
5.0  5.000000  5.50

147

answered Oct 13 '22 18:10

cs95

Related questions
                            
                                pandas.eval with a boolean series with missing data
                            
                                Scikit image: resize() got an unexpected keyword argument 'anti_aliasing'
                            
                                numpy array indexing with lists and arrays
                            
                                Converting embedded Excel objects from a docx file into images
                            
                                Is it possible to split a Jupyter cell across cells when it contains a function, loop, or other block?
                            
                                gRPC: Rendezvous terminated with (StatusCode.INTERNAL, Received RST_STREAM with error code 2)
                            
                                Python PIL: font weight and style
                            
                                HDBSCAN Python choose number of clusters
                            
                                How to convert a spectrogram to 3d plot. Python
                            
                                Python PANDAS: Converting from pandas/numpy to dask dataframe/array
                            
                                Can't verify hashes for these requirements because we don't have a way to hash version control repositories
                            
                                Python: Pandas wrongly excluding column in groupby
                            
                                Type-hinting for the __init__ function from class meta information in Python
                            
                                Close session after use
                            
                                Shift interpolation does not give expected behaviour
                            
                                How to install tensorflow-1.2.1 in Docker which has alpine:3.7 as base image ? I am using python 3
                            
                                How to fix error: django.db.utils.NotSupportedError: URIs not supported
                            
                                Python: How to optimize function parameters?
                            
                                Use a method/function to format xlsx writer
                            
                                Solve a simple packing combination with dependencies

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to group near-duplicate values in a pandas dataframe?

Tags:

python

pandas

pandas-groupby

Alexander

People also ask

1 Answers

cs95

Recent Activity

Donate For Us