How to aggregate, combining dataframes, with pandas groupby

Tags:

I have a dataframe df and a column df['table'] such that each item in df['table'] is another dataframe with the same headers/number of columns. I was wondering if there's a way to do a groupby like this:

Original dataframe:

name    table
Bob     Pandas df1
Joe     Pandas df2
Bob     Pandas df3
Bob     Pandas df4
Emily   Pandas df5

After groupby:

name    table
Bob     Pandas df containing the appended df1, df3, and df4
Joe     Pandas df2
Emily   Pandas df5

I found this code snippet to do a groupby and lambda for strings in a dataframe, but haven't been able to figure out how to append entire dataframes in a groupby.

df['table'] = df.groupby(['name'])['table'].transform(lambda x : ' '.join(x))

I've also tried df['table'] = df.groupby(['name'])['HTML'].apply(list), but that gives me a df['table'] of all NaN.

Thanks for your help!!

933

asked Oct 07 '20 18:10

Anonymous

2 Answers

Given 3 dataframes

import pandas as pd

dfa = pd.DataFrame({'a': [1, 2, 3]})
dfb = pd.DataFrame({'a': ['a', 'b', 'c']})
dfc = pd.DataFrame({'a': ['pie', 'steak', 'milk']})

Given another dataframe, with dataframes in the columns

df = pd.DataFrame({'name': ['Bob', 'Joe', 'Bob', 'Bob', 'Emily'], 'table': [dfa, dfa, dfb, dfc, dfb]})

# print the type for the first value in the table column, to confirm it's a dataframe
print(type(df.loc[0, 'table']))
[out]:
<class 'pandas.core.frame.DataFrame'>

Each group of dataframes, can be combined into a single dataframe, by using .groupby and aggregating a list for each group, and combining the dataframes in the list, with pd.concat

# if there is only one column, or if there are multiple columns of dataframes to aggregate
dfg = df.groupby('name').agg(lambda x: pd.concat(list(x)).reset_index(drop=True))

# display(dfg.loc['Bob', 'table'])
       a
0      1
1      2
2      3
3      a
4      b
5      c
6    pie
7  steak
8   milk

# to specify a single column, or specify multiple columns, from many columns
dfg = df.groupby('name')[['table']].agg(lambda x: pd.concat(list(x)).reset_index(drop=True))

Not a duplicate

Originally, I had marked this question as a duplicate of How to group dataframe rows into list in pandas groupby, thinking the dataframes could be aggregated into a list, and then combined with pd.concat.

df.groupby('name')['table'].apply(list)
df.groupby('name').agg(list)
df.groupby('name')['table'].agg(list)
df.groupby('name').agg({'table': list})
df.groupby('name').agg(lambda x: list(x))

However, these all result in a StopIteration error, when there are dataframes to aggregate.

answered Oct 16 '22 19:10

Trenton McKinney

Here let's create a dataframe with dataframes as columns:

First, I start with three dataframes:

import pandas as pd

#creating dataframes that we will assign to Bob and Joe, notice b's and j':

df1 = pd.DataFrame({'var1':[12, 34, -4, None], 'letter':['b1', 'b2', 'b3', 'b4']})
df2 = pd.DataFrame({'var1':[1, 23, 44, 0], 'letter':['j1', 'j2', 'j3', 'j4']})
df3 = pd.DataFrame({'var1':[22, -3, 7, 78], 'letter':['b5', 'b6', 'b7', 'b8']})

#lets make a list of dictionaries:
list_of_dfs = [
    {'name':'Bob' ,'table':df1},
    {'name':'Joe' ,'table':df2},
    {'name':'Bob' ,'table':df3}
]

#constuct the main dataframe:
original_df = pd.DataFrame(list_of_dfs)
print(original_df)

original_df.shape #shows (3, 2)

Now we have the original dataframe created as the input, we will produce the resulting new dataframe. In doing so, we use groupby(),agg(), and pd.concat(). We also reset the index.

new_df = original_df.groupby('name')['table'].agg(lambda series: pd.concat(series.tolist())).reset_index()
print(new_df)

#check that Bob's table is now a concatenated table of df1 and df3:
new_df[new_df['name']=='Bob']['table'][0]

The output to the last line of code is:

    var1    letter
0   12.0    b1
1   34.0    b2
2   -4.0    b3
3    NaN    b4
0   22.0    b5
1   -3.0    b6
2    7.0    b7
3   78.0    b8

answered Oct 16 '22 20:10

Muhammet Coskun

Related questions
                            
                                TypeError: cannot pickle '_thread.lock' object with RQ
                            
                                ModuleNotFoundError: No module named 'tensorflow.contrib' with tensorflow=2.0.0
                            
                                How to set hour range and minute interval using APScheduler
                            
                                How do I import something from a nested child directory with Python?
                            
                                Finding out the missing value in dataframe based on a column
                            
                                How to configure Celery Worker and Beat for Email Reporting in Apache Superset running on Docker?
                            
                                pyspark: arrays_zip equivalent in Spark 2.3
                            
                                python: yield inside map function
                            
                                Unable to fetch all the links from a webpage using requests
                            
                                Download file/folder from Public AWS S3 with Python, no credentials
                            
                                How to change max_iter in optimize function used by sklearn gaussian process regression?
                            
                                Problems understanding linear regression model tuning in tf.keras
                            
                                Interesting performance of creating objects via normal class, data class and named tuple
                            
                                Wrong address model when compiling boost
                            
                                How can I prevent stack from sorting indices?
                            
                                How to get the same percent_rank in SQL and pandas?
                            
                                How to scale target values of a Keras autoencoder model using a sklearn pipeline?
                            
                                import _tkinter # If this fails your Python may not be configured for Tk error in python 3.8
                            
                                How to see Python print statements from running Fargate ECS task?
                            
                                Finetune SavedModel Failure due to No Gradient loaded

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to aggregate, combining dataframes, with pandas groupby

Tags:

python

pandas

dataframe

lambda

pandas-groupby