Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas -- merging mostly duplicated rows

Some of my data looks like:

date, name, value1, value2, value3, value4
1/1/2001,ABC,1,1,,
1/1/2001,ABC,,,2,
1/1/2001,ABC,,,,35

I am trying to get to the point where I can run

data.set_index(['date', 'name'])

But, with the data as-is, there are of course duplicates (as shown in the above), so I cannot do this (and I don't want an index with duplicates, and I can't simply drop_duplicates(), since this would lose data).

I would like to be able to force rows which have the same [date, name] values into a single rows, if they can be successfully converged based on certain values being NaN (similar to the behavior of combine_first()). E.g., the above would end up at

date, name, value1, value2, value3, value4
1/1/2001,ABC,1,1,2,35

If two values are different and one is not NaN, the two rows should not be converged (this would probably be an error that I would need to follow up on).

(To extend the above example, there may in fact be an arbitrary number of lines--given an arbitrary number of columns--which should be able to be converged into one single line.)

This feels like a problem that should be very solvable via pandas, but I am having trouble figuring out an elegant solution.

like image 968
severian Avatar asked Jun 09 '13 04:06

severian


People also ask

How do I avoid duplicates in pandas merge?

To concatenate DataFrames, use the concat() method, but to ignore duplicates, use the drop_duplicates() method.

How do I get identical rows in pandas?

The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.

How do I merge rows in pandas?

To merge rows within a group together in Pandas we can use the agg(~) method together with the join(~) method to concatenate the row values.


2 Answers

Let's imagine you have some function combine_it that, given a set of rows that would have duplicate values, returns a single row. First, group by date and name:

grouped = data.groupby(['date', 'name'])

Then just apply the aggregation function and boom you're done:

result = grouped.agg(combine_it)

You can also provide different aggregation functions for different columns by passing agg a dict.

like image 100
Jeff Tratner Avatar answered Oct 10 '22 11:10

Jeff Tratner


If you do not have numeric field values, aggregating with count, min, sum etc. will not be neither possible nor sensible. Nevertheless, you still may want to collapse duplicate records to individual records (e.g.) based on one or more primary keys.

# Firstly, avoid Nan values in the columns you are grouping on!
df[['col1', 'col2']] =  df[['col1', 'col2']].fillna('null')


  # Define your own customized operation in pandas agg() function
df = df.groupby(['col1', 'col2']).agg({'SEARCH_TERM':lambda x: ', '.join(tuple(x.tolist())),

                                     'HITS_CONTENT':lambda x: ', '.join(tuple(x.tolist()))}
                                   )

Group by one or more columns and collapse values values by converting them first, to list, then to tuple and finally to string. If you prefer you can also keep them as list or tuple stored in each field or apply with the agg. function and a dictionary very different operations to different columns.

like image 30
Philipp Schwarz Avatar answered Oct 10 '22 12:10

Philipp Schwarz