Some of my data looks like:
date, name, value1, value2, value3, value4
1/1/2001,ABC,1,1,,
1/1/2001,ABC,,,2,
1/1/2001,ABC,,,,35
I am trying to get to the point where I can run
data.set_index(['date', 'name'])
But, with the data as-is, there are of course duplicates (as shown in the above), so I cannot do this (and I don't want an index with duplicates, and I can't simply drop_duplicates(), since this would lose data).
I would like to be able to force rows which have the same [date, name] values into a single rows, if they can be successfully converged based on certain values being NaN (similar to the behavior of combine_first()). E.g., the above would end up at
date, name, value1, value2, value3, value4
1/1/2001,ABC,1,1,2,35
If two values are different and one is not NaN, the two rows should not be converged (this would probably be an error that I would need to follow up on).
(To extend the above example, there may in fact be an arbitrary number of lines--given an arbitrary number of columns--which should be able to be converged into one single line.)
This feels like a problem that should be very solvable via pandas, but I am having trouble figuring out an elegant solution.
To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.
The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.
To merge rows within a group together in Pandas we can use the agg(~) method together with the join(~) method to concatenate the row values.
Let's imagine you have some function combine_it
that, given a set of rows that would have duplicate values, returns a single row. First, group by date
and name
:
grouped = data.groupby(['date', 'name'])
Then just apply the aggregation function and boom you're done:
result = grouped.agg(combine_it)
You can also provide different aggregation functions for different columns by passing agg
a dict.
If you do not have numeric field values, aggregating with count, min, sum etc. will not be neither possible nor sensible. Nevertheless, you still may want to collapse duplicate records to individual records (e.g.) based on one or more primary keys.
# Firstly, avoid Nan values in the columns you are grouping on!
df[['col1', 'col2']] = df[['col1', 'col2']].fillna('null')
# Define your own customized operation in pandas agg() function
df = df.groupby(['col1', 'col2']).agg({'SEARCH_TERM':lambda x: ', '.join(tuple(x.tolist())),
'HITS_CONTENT':lambda x: ', '.join(tuple(x.tolist()))}
)
Group by one or more columns and collapse values values by converting them first, to list, then to tuple and finally to string. If you prefer you can also keep them as list or tuple stored in each field or apply with the agg. function and a dictionary very different operations to different columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With