Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas merge two dataframes with different columns

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.

>df_may    id  quantity  attr_1  attr_2 0  1        20       0       1 1  2        23       1       1 2  3        19       1       1 3  4        19       0       0  >df_jun    id  quantity  attr_1  attr_3 0  5         8       1       0 1  6        13       0       1 2  7        20       1       1 3  8        25       1       1 

I've tried joining with an outer join:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer") 

But that yields:

Left data columns not unique: Index([.... 

I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values)) 

Which yields:

ValueError: Buffer has wrong number of dimensions (expected 1, got 2) 

What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.

Thanks in advance.

like image 347
economy Avatar asked Jan 22 '15 19:01

economy


People also ask

How do I merge two DataFrames with different columns in pandas?

It is possible to join the different columns is using concat() method. DataFrame: It is dataframe name. axis: 0 refers to the row axis and1 refers the column axis. join: Type of join.

Can you merge DataFrames on multiple columns?

You can pass two DataFrame to be merged to the pandas. merge() method. This collects all common columns in both DataFrames and replaces each common column in both DataFrame with a single one.

Can you merge two DataFrames of different lengths pandas?

It can be done using the merge() method. Below are some examples that depict how to merge data frames of different lengths using the above method: Example 1: Below is a program to merge two student data frames of different lengths.

How do I merge two pandas DataFrames?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.


2 Answers

I think in this case concat is what you want:

In [12]:  pd.concat([df,df1], axis=0, ignore_index=True) Out[12]:    attr_1  attr_2  attr_3  id  quantity 0       0       1     NaN   1        20 1       1       1     NaN   2        23 2       1       1     NaN   3        19 3       0       0     NaN   4        19 4       1     NaN       0   5         8 5       0     NaN       1   6        13 6       1     NaN       1   7        20 7       1     NaN       1   8        25 

by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

like image 162
EdChum Avatar answered Sep 21 '22 23:09

EdChum


The accepted answer will break if there are duplicate headers:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects.

For example, here A has 3x trial columns, which prevents concat:

A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial']) #    id  trial  trial  trial # 0   3      1      4      1  B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial']) #    id  trial # 0   5      9 # 1   2      6  pd.concat([A, B], ignore_index=True) # InvalidIndexError: Reindexing only valid with uniquely valued Index objects 

To fix this, deduplicate the column names before concat:

parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})  for df in [A, B]:     df.columns = parser._maybe_dedup_names(df.columns)   pd.concat([A, B], ignore_index=True) #    id  trial  trial.1  trial.2 # 0   3      1        4        1 # 1   5      9      NaN      NaN # 2   2      6      NaN      NaN 

Or as a one-liner but less readable:

pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True) 

Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})

like image 32
tdy Avatar answered Sep 20 '22 23:09

tdy