None of these questions adress the issue: Question 1 and Question 2 nor could I find the answer in pandas documentation.
Hello, I am trying to find the underlying cause for this error:
ValueError: You are trying to merge on object and int64 columns.
I know I can work around this problem using pandas concat
or merge
function, but I am trying to understand the cause for the error. The question is: Why do I get this ValueError
?
Here's the output of the head(5)
and info()
on both dataframes that are used.
print(the_big_df.head(5))
Output:
account apt apt_p balance date day flag month reps reqid year
0 AA0420 0 0.0 -578.30 2019-03-01 1 1 3 10 82f2d761 2019
1 AA0420 0 0.1 -578.30 2019-03-02 2 1 3 10 82f2d761 2019
2 AA0420 0 0.1 -578.30 2019-03-03 3 1 3 10 82f2d761 2019
3 AA0421 0 0.1 -607.30 2019-03-04 4 1 3 10 82f2d761 2019
4 AA0421 0 0.1 -610.21 2019-03-05 5 1 3 10 82f2d761 2019
print(the_big_df.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36054 entries, 0 to 36053
Data columns (total 11 columns):
account 36054 non-null object
apt 36054 non-null int64
apt_p 36054 non-null float64
balance 36054 non-null float64
date 36054 non-null datetime64[ns]
day 36054 non-null int64
flag 36054 non-null int64
month 36054 non-null int64
reps 36054 non-null int32
reqid 36054 non-null object
year 36054 non-null int64
dtypes: datetime64[ns](1), float64(2), int32(1), int64(5), object(2)
memory usage: 3.2+ MB
Here's the dataframe I'm passing to the join()
; print(df_to_join.head(5))
:
reqid id
0 54580f39 13301
1 3ba905c0 77114
2 5f2d80da 13302
3 a1478e98 77115
4 9b09854b 78598
print(df_to_join.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14332 entries, 0 to 14331
Data columns (total 2 columns):
reqid 14332 non-null object
dni 14332 non-null object
The exact next line after the 4 prints stated above is:
the_max_df = the_big_df.join(df_to_join,on='reqid')
And the output is, as stated above:
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
Why does this happen, when before it is clearly stated that column reqid
is an object in both dataframes? Thanks.
Use pandas. Series. astype() twice to convert a DataFrame column from object to int.
By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.
Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.
We can use the concat function in pandas to append either columns or rows from one DataFrame to another. Let's grab two subsets of our data to see how this works. When we concatenate DataFrames, we need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one.
The problem here is a misconception of how join works: when you say the_big_df.join(df_to_join,on='reqid')
it does not mean join on the_big_df.reqid == df_to_join.reqid
as one would assume at first glance, but rather join on the_big_df.reqid == df_to_join.index
. As requid
is of type object
and the index is of type int64
you get the error.
See the docs for join
:
Join columns with other DataFrame either on index or on a key column.
...
on : str, list of str, or array-like, optional
Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.
Look at the following example:
df1 = pd.DataFrame({'id1': [1, 2], 'val1': [11,12]})
df2 = pd.DataFrame({'id2': [3, 4], 'val2': [21,22]})
print(df1)
# id1 val1
#0 1 11
#1 2 12
print(df2)
# id2 val2
#0 3 21
#1 4 22
# join on df1.id1 (int64) == df2.index (int64)
print(df1.join(df2, on='id1'))
# id1 val1 id2 val2
#0 1 11 4.0 22.0
#1 2 12 NaN NaN
# now df3 same as df1 but id3 as object:
df3 = pd.DataFrame({'id3': ['1', '2'], 'val1': [11,12]})
# try to join on df3.id3 (object) == df2.index (int64)
df3.join(df2, on='id3')
#ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
>>> df3.join(df2, on='id3')
id3 val1 id2 val2
0 1 11 NaN NaN
1 2 12 NaN NaN
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With