Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combining rows in pandas [duplicate]

Tags:

python

pandas

I have a DataFrame with an index called city_id of cities in the format [city],[state] (e.g., new york,ny containing integer counts in the columns. The problem is that I have multiple rows for the same city, and I want to collapse the rows sharing a city_id by adding their column values. I looked at groupby() but it wasn't immediately obvious how to apply it to this problem.

Edit:

An example: I'd like to change this:

city_id    val1 val2 val3 houston,tx    1    2    0 houston,tx    0    0    1 houston,tx    2    1    1 

into this:

city_id    val1 val2 val3 houston,tx    3    3    2 

if there are ~10-20k rows.

like image 773
lightlike Avatar asked Jul 03 '13 02:07

lightlike


People also ask

How do I merge rows in pandas?

To merge rows within a group together in Pandas we can use the agg(~) method together with the join(~) method to concatenate the row values.

How avoid duplicates in pandas merge?

merge() function to join the two data frames by inner join. Now, add a suffix called 'remove' for newly joined columns that have the same name in both data frames. Use the drop() function to remove the columns with the suffix 'remove'. This will ensure that identical columns don't exist in the new dataframe.


2 Answers

Starting from

>>> df               val1  val2  val3 city_id                        houston,tx       1     2     0 houston,tx       0     0     1 houston,tx       2     1     1 somewhere,ew     4     3     7 

I might do

>>> df.groupby(df.index).sum()               val1  val2  val3 city_id                        houston,tx       3     3     2 somewhere,ew     4     3     7 

or

>>> df.reset_index().groupby("city_id").sum()               val1  val2  val3 city_id                        houston,tx       3     3     2 somewhere,ew     4     3     7 

The first approach passes the index values (in this case, the city_id values) to groupby and tells it to use those as the group keys, and the second resets the index and then selects the city_id column. See this section of the docs for more examples. Note that there are lots of other methods in the DataFrameGroupBy objects, too:

>>> df.groupby(df.index) <pandas.core.groupby.DataFrameGroupBy object at 0x1045a1790> >>> df.groupby(df.index).max()               val1  val2  val3 city_id                        houston,tx       2     2     1 somewhere,ew     4     3     7 >>> df.groupby(df.index).mean()               val1  val2      val3 city_id                            houston,tx       1     1  0.666667 somewhere,ew     4     3  7.000000 
like image 162
DSM Avatar answered Oct 02 '22 11:10

DSM


Something in the same line. Sorry not the exact replica.

mydata = [{'subid' : 'B14-111', 'age': 75, 'fdg':1.78},           {'subid' : 'B14-112', 'age': 22, 'fdg':1.56},{'subid' : 'B14-112', 'age': 40, 'fdg':2.00},] df = pandas.DataFrame(mydata)  gg = df.groupby("subid",sort=True).sum() 
like image 43
LonelySoul Avatar answered Oct 02 '22 10:10

LonelySoul