Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference(s) between merge() and concat() in pandas

People also ask

What is the difference between merge () and concat () in pandas?

Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.

What is concat in pandas?

The concat() function is used to concatenate pandas objects along a particular axis with optional set logic along the other axes. Syntax: pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)

What is difference between concatenate and combine?

The word concatenate is just another way of saying "to combine" or "to join together". The CONCATENATE function allows you to combine text from different cells into one cell. In our example, we can use it to combine the text in column A and column B to create a combined name in a new column.

Is merge or join faster pandas?

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.


A very high level difference is that merge() is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True and/or right_index=True), and concat() is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis option is set to 0 or 1).

join() is used to merge 2 dataframes on the basis of the index; instead of using merge() with the option left_index=True we can use join().

For example:

df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})

df1:
   Key  data1
0   b   0
1   b   1
2   a   2
3   c   3
4   a   4
5   a   5
6   b   6

df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})

df2:
    Key data2
0   a   0
1   b   1
2   d   2

#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is 
# a common column in 2 dataframes

pd.merge(df1, df2)

   Key data1 data2
0   b    0    1
1   b    1    1
2   b    6    1
3   a    2    0
4   a    4    0
5   a    5    0

#Concat
# df2 dataframe is appended at the bottom of df1 

pd.concat([df1, df2])

   Key data1 data2
0   b   0     NaN
1   b   1     NaN
2   a   2     NaN
3   c   3     NaN
4   a   4     NaN
5   a   5     NaN
6   b   6     NaN
0   a   Nan   0
1   b   Nan   1
2   d   Nan   2

At a high level:

  • .concat() simply stacks multiple DataFrame together either vertically, or stitches horizontally after aligning on index
  • .merge() first aligns two DataFrame' selected common column(s) or index, and then pick up the remaining columns from the aligned rows of each DataFrame.

More specifically, .concat():

  • Is a top-level pandas function
  • Combines two or more pandas DataFrame vertically or horizontally
  • Aligns only on the index when combining horizontally
  • Errors when any of the DataFrame contains a duplicate index.
  • Defaults to outer join with the option for inner join

And .merge():

  • Exists both as a top-level pandas function and a DataFrame method (as of pandas 1.0)
  • Combines exactly two DataFrame horizontally
  • Aligns the calling DataFrame's column(s) or index with the other DataFrame's column(s) or index
  • Handles duplicate values on the joining columns or index by performing a cartesian product
  • Defaults to inner join with options for left, outer, and right

Note that when performing pd.merge(left, right), if left has two rows containing the same values from the joining columns or index, each row will combine with right's corresponding row(s) resulting in a cartesian product. On the other hand, if .concat() is used to combine columns, we need to make sure no duplicated index exists in either DataFrame.

Practically speaking:

  • Consider .concat() first when combining homogeneous DataFrame, while consider .merge() first when combining complementary DataFrame.
  • If need to merge vertically, go with .concat(). If need to merge horizontally via columns, go with .merge(), which by default merge on the columns in common.

Reference: Pandas 1.x Cookbook


pd.concat takes an Iterable as its argument. Hence, it cannot take DataFrames directly as its argument. Also Dimensions of the DataFrame should match along axis while concatenating.

pd.merge can take DataFrames as its argument, and is used to combine two DataFrames with same columns or index, which can't be done with pd.concat since it will show the repeated column in the DataFrame.

Whereas join can be used to join two DataFrames with different indices.


I am currently trying to understand the essential difference(s) between pd.DataFrame.merge() and pd.concat().

Nice question. The main difference:

pd.concat works on both axes.

The other difference, is pd.concat has innerdefault and outer joins only, while pd.DataFrame.merge() has left, right, outer, innerdefault joins.

Third notable other difference is: pd.DataFrame.merge() has the option to set the column suffixes when merging columns with the same name, while for pd.concat this is not possible.


With pd.concat by default you are able to stack rows of multiple dataframes (axis=0) and when you set the axis=1 then you mimic the pd.DataFrame.merge() function.

Some useful examples of pd.concat:

df2=pd.concat([df]*2, ignore_index=True) #double the rows of a dataframe

df2=pd.concat([df, df.iloc[[0]]]) # add first row to the end

df3=pd.concat([df1,df2], join='inner', ignore_index=True) # concat two df's