If there are duplicate values in a DataFrame pandas already provides functions to replace or drop duplicates. In many experimental datasets on the other hand one might have 'near' duplicates.
How can one replace these near duplicate values with, e.g. their mean?
The example data looks as follows:
df = pd.DataFrame({'x': [1, 2,2.01, 3, 4,4.1,3.95, 5,],
'y': [1, 2,2.2, 3, 4.1,4.4,4.01, 5.5]})
I tried to hack together something to bin together near duplicates but this is using for loops and seems like a hack against pandas:
def cluster_near_values(df, colname_to_cluster, bin_size=0.1):
used_x = [] # list of values already grouped
group_index = 0
for search_value in df[colname_to_cluster]:
if search_value in used_x:
# value is already in a group, skip to next
continue
g_ix = df[abs(df[colname_to_cluster]-search_value) < bin_size].index
used_x.extend(df.loc[g_ix, colname_to_cluster])
df.loc[g_ix, 'cluster_group'] = group_index
group_index += 1
return df.groupby('cluster_group').mean()
Which does the grouping and averaging:
print(cluster_near_values(df, 'x', 0.1))
x y
cluster_group
0.0 1.000000 1.00
1.0 2.005000 2.10
2.0 3.000000 3.00
3.0 4.016667 4.17
4.0 5.000000 5.50
Is there a better way to achieve this?
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
Here's an example, where you want to group items to one digit of precision. You can modify this as needed. You can also modify this for binning values with threshold over 1.
df.groupby(np.ceil(df['x'] * 10) // 10).mean()
x y
x
1.0 1.000000 1.00
2.0 2.005000 2.10
3.0 3.000000 3.00
4.0 4.016667 4.17
5.0 5.000000 5.50
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With