Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

More efficient way to add columns with same string values in multiple dataframes with loops or lambdas?

I want to add a new column, Category, in each of my 8 similar dataframes. The values in this column are the same, they are also the df name, like df1_p8 in this example. I have used:

In:     df61_p8.insert(3,"Category","df61_p8", True)
# or simply, df61_p8['Category']='df61_p8'

Out:
        code    violation_description                                            Category
89491   9-1-503 Defective or obstructed duct system one- building                df61_p8
102045  9-1-503 Defective or obstructed duct system one- building                df61_p8
103369  9-1-503 Defective or obstructed duct system one- building                df61_p8
130440  9-1-502 Failure to maintain at least one (1) elevator                    df61_p8
132446  9-1-503 Defective or obstructed duct system one-  building               df61_p8

Ultimately, I want to append/concat these 8 dataframes into one dataframe.

I wonder if there is more efficient way to do it, rather than using .insert one by one on each dataframe. Something like loops or lambdas.. As a beginner, I am not sure how to apply them in my case? thank you.

append_alldfs = []
x=[df61_p1,df61_p2,df61_p3,df61_p4,df61_p5,df61_p6,df61_p7,df61_p8]
lambdafunc = lambda x: x.insert(3,"Category","x",True)
like image 903
Bluetail Avatar asked Jul 07 '20 16:07

Bluetail


3 Answers

You can use pd.concat with keys parameter then reset_index:

pd.concat([df0,df1,df2,df3], keys=['df0', 'df1', 'df2', 'df3']).reset_index(level=0) 

MCVE:

df0  = pd.DataFrame(np.ones((3,3)), columns=[*'ABC'])
df1  = pd.DataFrame(np.zeros((3,3)), columns=[*'ABC'])
df2  = pd.DataFrame(np.zeros((3,3))+3, columns=[*'ABC'])
df3  = pd.DataFrame(np.zeros((3,3))+4, columns=[*'ABC'])

df_out = pd.concat([df0,df1,df2,df3], keys=['df0', 'df1', 'df2', 'df3']).reset_index(level=0)
df_out

Output:

  level_0    A    B    C
0     df0  1.0  1.0  1.0
1     df0  1.0  1.0  1.0
2     df0  1.0  1.0  1.0
0     df1  0.0  0.0  0.0
1     df1  0.0  0.0  0.0
2     df1  0.0  0.0  0.0
0     df2  3.0  3.0  3.0
1     df2  3.0  3.0  3.0
2     df2  3.0  3.0  3.0
0     df3  4.0  4.0  4.0
1     df3  4.0  4.0  4.0
2     df3  4.0  4.0  4.0
like image 120
Scott Boston Avatar answered Sep 19 '22 00:09

Scott Boston


Keep it simple and explicit.

for col_val, df in [
   ('df61_p1', df61_p1),
   ('df61_p2', df61_p2),
   ('df61_p3', df61_p3),
   ('df61_p4', df61_p4),
   ('df61_p5', df61_p5),
   ('df61_p6', df61_p6),
   ('df61_p7', df61_p7),
   ('df61_p8', df61_p8),
]:
    df['Category'] = col_val

While there are certainly more 'meta-programming-ey' ways of accomplishing the same task, these are usually quite convoluted and more complicated to understand and refactor.

Given the structure of this code, however, I imagine that there are ways you could get rid of this problem before you even get to this point.

For example, at what point did those dataframes get split up? Perhaps by never using separate DataFrames in the first place [keep the original dataframe together/concat at beginning] (and using apply, groupby, pivot and melt operations as needed), you can avoid this problem altogether.

like image 31
Mukul Ram Avatar answered Sep 19 '22 00:09

Mukul Ram


def add_column(df, col_name, col_value):
  return df.insert(loc=-1, column=col_name, value=col_value, allow_duplicates = False)

df_list = [........]
col_name = ... 
col_value = .... # copy column (Category) values

res = map(lambda df: add_column(df, col_name, col_value), df_list)
list(res)
like image 37
4.Pi.n Avatar answered Sep 19 '22 00:09

4.Pi.n