Python Pandas -- merging mostly duplicated rows

Tags:

Some of my data looks like:

date, name, value1, value2, value3, value4
1/1/2001,ABC,1,1,,
1/1/2001,ABC,,,2,
1/1/2001,ABC,,,,35

I am trying to get to the point where I can run

data.set_index(['date', 'name'])

But, with the data as-is, there are of course duplicates (as shown in the above), so I cannot do this (and I don't want an index with duplicates, and I can't simply drop_duplicates(), since this would lose data).

I would like to be able to force rows which have the same [date, name] values into a single rows, if they can be successfully converged based on certain values being NaN (similar to the behavior of combine_first()). E.g., the above would end up at

date, name, value1, value2, value3, value4
1/1/2001,ABC,1,1,2,35

If two values are different and one is not NaN, the two rows should not be converged (this would probably be an error that I would need to follow up on).

(To extend the above example, there may in fact be an arbitrary number of lines--given an arbitrary number of columns--which should be able to be converged into one single line.)

This feels like a problem that should be very solvable via pandas, but I am having trouble figuring out an elegant solution.

968

asked Jun 09 '13 04:06

severian

2 Answers

Let's imagine you have some function combine_it that, given a set of rows that would have duplicate values, returns a single row. First, group by date and name:

grouped = data.groupby(['date', 'name'])

Then just apply the aggregation function and boom you're done:

result = grouped.agg(combine_it)

You can also provide different aggregation functions for different columns by passing agg a dict.

100

answered Oct 10 '22 11:10

Jeff Tratner

If you do not have numeric field values, aggregating with count, min, sum etc. will not be neither possible nor sensible. Nevertheless, you still may want to collapse duplicate records to individual records (e.g.) based on one or more primary keys.

# Firstly, avoid Nan values in the columns you are grouping on!
df[['col1', 'col2']] =  df[['col1', 'col2']].fillna('null')


  # Define your own customized operation in pandas agg() function
df = df.groupby(['col1', 'col2']).agg({'SEARCH_TERM':lambda x: ', '.join(tuple(x.tolist())),

                                     'HITS_CONTENT':lambda x: ', '.join(tuple(x.tolist()))}
                                   )

Group by one or more columns and collapse values values by converting them first, to list, then to tuple and finally to string. If you prefer you can also keep them as list or tuple stored in each field or apply with the agg. function and a dictionary very different operations to different columns.

answered Oct 10 '22 12:10

Philipp Schwarz

Related questions
                            
                                How to export Estimator model with export_savedmodel function
                            
                                Is `setup.cfg` deprecated?
                            
                                How to pass rgb color values to python's matplotlib eventplot?
                            
                                ValueWarning: No frequency information was provided, so inferred frequency MS will be used
                            
                                Shared python generator
                            
                                What is the non deprecated version of open "U" mode
                            
                                How do you design data models for Bigtable/Datastore (GAE)?
                            
                                Python library to modify MP3 audio without transcoding
                            
                                Turning on debug output for python 3 urllib
                            
                                Good way to edit the previous defined class in ipython
                            
                                python: iif or (x ? a : b) [duplicate]
                            
                                Python "ImportError: No module named" Problem
                            
                                Learning Ruby from Python; Differences and Similarities
                            
                                Create a tuple from a string and a list of strings
                            
                                How do I use Flask routes with Apache and mod_wsgi?
                            
                                python - how to add unicode literal to a variable?
                            
                                Ignoring CalledProcessError
                            
                                Pandas: Get label for value in Series Object
                            
                                How to get a capture group that doesnt always exist?
                            
                                WSGI: what's the purpose of start_response function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Pandas -- merging mostly duplicated rows

Tags:

python

pandas

dataframe

duplicates

severian

People also ask

2 Answers

Jeff Tratner

Philipp Schwarz

Recent Activity

Donate For Us