I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa. <pre class="prettyprint"><code>>df_may id quantity attr_1 attr_2 0 1 20 0 1 1 2 23 1 1 2 3 19 1 1 3 4 19 0 0 >df_jun id quantity attr_1 attr_3 0 5 8 1 0 1 6 13 0 1 2 7 20 1 1 3 8 25 1 1 </code></pre> I've tried joining with an outer join: <pre class="prettyprint"><code>mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer") </code></pre> But that yields: <pre class="prettyprint"><code>Left data columns not unique: Index([.... </code></pre> I've also specified a single column to join on (<code>on = "id"</code>, e.g.), but that duplicates all columns except <code>id</code> like <code>attr_1_x</code>, <code>attr_1_y</code>, which is not ideal. I've also passed the entire list of columns (there are many) to <code>on</code>: <pre class="prettyprint"><code>mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values)) </code></pre> Which yields: <pre class="prettyprint"><code>ValueError: Buffer has wrong number of dimensions (expected 1, got 2) </code></pre> What am I missing? I'd like to get a df with all rows appended, and <code>attr_1</code>, <code>attr_2</code>, <code>attr_3</code> populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck. Thanks in advance.

The accepted answer will break if there are duplicate headers: <blockquote> InvalidIndexError: Reindexing only valid with uniquely valued Index objects. </blockquote> For example, here <code>A</code> has 3x <code>trial</code> columns, which prevents <code>concat</code>: <pre class="prettyprint"><code>A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial']) # id trial trial trial # 0 3 1 4 1 B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial']) # id trial # 0 5 9 # 1 2 6 pd.concat([A, B], ignore_index=True) # InvalidIndexError: Reindexing only valid with uniquely valued Index objects </code></pre> To fix this, deduplicate the column names before <code>concat</code>: <pre class="prettyprint"><code>parser = pd.io.parsers.base_parser.ParserBase({'usecols': None}) for df in [A, B]: df.columns = parser._maybe_dedup_names(df.columns) pd.concat([A, B], ignore_index=True) # id trial trial.1 trial.2 # 0 3 1 4 1 # 1 5 9 NaN NaN # 2 2 6 NaN NaN </code></pre> Or as a one-liner but less readable: <pre class="prettyprint"><code>pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True) </code></pre> <hr> Note that for pandas <1.3.0, use: <code>parser = pd.io.parsers.ParserBase({})</code>

Pandas merge two dataframes with different columns

Tags:

python

pandas

dataframe

data-munging

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.

>df_may    id  quantity  attr_1  attr_2 0  1        20       0       1 1  2        23       1       1 2  3        19       1       1 3  4        19       0       0  >df_jun    id  quantity  attr_1  attr_3 0  5         8       1       0 1  6        13       0       1 2  7        20       1       1 3  8        25       1       1

I've tried joining with an outer join:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")

But that yields:

Left data columns not unique: Index([....

I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except id like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to on:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))

Which yields:

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.

Thanks in advance.

347

asked Jan 22 '15 19:01

economy

2 Answers

I think in this case concat is what you want:

In [12]:  pd.concat([df,df1], axis=0, ignore_index=True) Out[12]:    attr_1  attr_2  attr_3  id  quantity 0       0       1     NaN   1        20 1       1       1     NaN   2        23 2       1       1     NaN   3        19 3       0       0     NaN   4        19 4       1     NaN       0   5         8 5       0     NaN       1   6        13 6       1     NaN       1   7        20 7       1     NaN       1   8        25

by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

162

answered Sep 21 '22 23:09

EdChum

The accepted answer will break if there are duplicate headers:

InvalidIndexError: Reindexing only valid with uniquely valued Index objects.

For example, here A has 3x trial columns, which prevents concat:

A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial']) #    id  trial  trial  trial # 0   3      1      4      1  B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial']) #    id  trial # 0   5      9 # 1   2      6  pd.concat([A, B], ignore_index=True) # InvalidIndexError: Reindexing only valid with uniquely valued Index objects

To fix this, deduplicate the column names before concat:

parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})  for df in [A, B]:     df.columns = parser._maybe_dedup_names(df.columns)   pd.concat([A, B], ignore_index=True) #    id  trial  trial.1  trial.2 # 0   3      1        4        1 # 1   5      9      NaN      NaN # 2   2      6      NaN      NaN

Or as a one-liner but less readable:

pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)

Note that for pandas <1.3.0, use: parser = pd.io.parsers.ParserBase({})

answered Sep 20 '22 23:09

tdy

Related questions
                            
                                mixed slashes with os.path.join on windows
                            
                                What is the problem with reduce()?
                            
                                How do I run Python script using arguments in windows command line
                            
                                bit-wise operation unary ~ (invert)
                            
                                How to get the list of options that Python was compiled with?
                            
                                Python object.__repr__(self) should be an expression?
                            
                                Are locks unnecessary in multi-threaded Python code because of the GIL?
                            
                                Python 3 string.join() equivalent?
                            
                                Fail to get data on using read() of StringIO in python
                            
                                How to assert that an iterable is not empty on Unittest?
                            
                                How to use JDBC source to write and read data in (Py)Spark?
                            
                                URL Decode with Python 3
                            
                                format strings and named arguments in Python
                            
                                Object does not support item assignment error
                            
                                Unit testing a python app that uses the requests library
                            
                                pandas select from Dataframe using startswith
                            
                                What is wrong with using a bare 'except'? [duplicate]
                            
                                How do I use cache_clear() on python @functools.lru_cache
                            
                                Get all documents of a collection using Pymongo
                            
                                Exception thrown in multiprocessing Pool not detected

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With