Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.
The concat() function is used to concatenate pandas objects along a particular axis with optional set logic along the other axes. Syntax: pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)
The word concatenate is just another way of saying "to combine" or "to join together". The CONCATENATE function allows you to combine text from different cells into one cell. In our example, we can use it to combine the text in column A and column B to create a combined name in a new column.
As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.
A very high level difference is that merge()
is used to combine two (or more) dataframes on the basis of values of common columns (indices can also be used, use left_index=True
and/or right_index=True
), and concat()
is used to append one (or more) dataframes one below the other (or sideways, depending on whether the axis
option is set to 0 or 1).
join()
is used to merge 2 dataframes on the basis of the index; instead of using merge()
with the option left_index=True
we can use join()
.
For example:
df1 = pd.DataFrame({'Key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df1:
Key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2 = pd.DataFrame({'Key': ['a', 'b', 'd'], 'data2': range(3)})
df2:
Key data2
0 a 0
1 b 1
2 d 2
#Merge
# The 2 dataframes are merged on the basis of values in column "Key" as it is
# a common column in 2 dataframes
pd.merge(df1, df2)
Key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
#Concat
# df2 dataframe is appended at the bottom of df1
pd.concat([df1, df2])
Key data1 data2
0 b 0 NaN
1 b 1 NaN
2 a 2 NaN
3 c 3 NaN
4 a 4 NaN
5 a 5 NaN
6 b 6 NaN
0 a Nan 0
1 b Nan 1
2 d Nan 2
At a high level:
.concat()
simply stacks multiple DataFrame
together either
vertically, or stitches horizontally after aligning on index.merge()
first aligns two DataFrame
' selected common column(s) or
index, and then pick up the remaining columns from the aligned rows of each DataFrame
.More specifically, .concat()
:
DataFrame
vertically or horizontallyDataFrame
contains a duplicate index.And .merge()
:
DataFrame
method (as of pandas 1.0)DataFrame
horizontallyDataFrame
's column(s) or index with the other
DataFrame
's column(s) or indexNote that when performing pd.merge(left, right)
, if left
has two rows containing the same values from the joining columns or index, each row will combine with right
's corresponding row(s) resulting in a cartesian product. On the other hand, if .concat()
is used to combine columns, we need to make sure no duplicated index exists in either DataFrame
.
Practically speaking:
.concat()
first when combining homogeneous DataFrame
, while
consider .merge()
first when combining complementary DataFrame
..concat()
. If need to merge
horizontally via columns, go with .merge()
, which by default merge on the columns in common.Reference: Pandas 1.x Cookbook
pd.concat
takes an Iterable
as its argument. Hence, it cannot take DataFrame
s directly as its argument. Also Dimension
s of the DataFrame
should match along axis while concatenating.
pd.merge
can take DataFrame
s as its argument, and is used to combine two DataFrame
s with same columns or index, which can't be done with pd.concat
since it will show the repeated column in the DataFrame.
Whereas join can be used to join two DataFrame
s with different indices.
I am currently trying to understand the essential difference(s) between
pd.DataFrame.merge()
andpd.concat()
.
Nice question. The main difference:
pd.concat
works on both axes.The other difference, is pd.concat
has innerdefault and outer joins only, while pd.DataFrame.merge()
has left, right, outer, innerdefault joins.
Third notable other difference is: pd.DataFrame.merge()
has the option to set the column suffixes when merging columns with the same name, while for pd.concat
this is not possible.
With pd.concat
by default you are able to stack rows of multiple dataframes (axis=0
) and when you set the axis=1
then you mimic the pd.DataFrame.merge()
function.
Some useful examples of pd.concat
:
df2=pd.concat([df]*2, ignore_index=True) #double the rows of a dataframe
df2=pd.concat([df, df.iloc[[0]]]) # add first row to the end
df3=pd.concat([df1,df2], join='inner', ignore_index=True) # concat two df's
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With