Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble with df.join(): ValueError: You are trying to merge on object and int64 columns

None of these questions adress the issue: Question 1 and Question 2 nor could I find the answer in pandas documentation.

Hello, I am trying to find the underlying cause for this error:

ValueError: You are trying to merge on object and int64 columns.

I know I can work around this problem using pandas concat or merge function, but I am trying to understand the cause for the error. The question is: Why do I get this ValueError?

Here's the output of the head(5) and info() on both dataframes that are used.

print(the_big_df.head(5)) Output:

  account  apt  apt_p  balance       date  day    flag  month  reps     reqid  year
0  AA0420    0    0.0  -578.30 2019-03-01    1       1      3    10  82f2d761  2019
1  AA0420    0    0.1  -578.30 2019-03-02    2       1      3    10  82f2d761  2019
2  AA0420    0    0.1  -578.30 2019-03-03    3       1      3    10  82f2d761  2019
3  AA0421    0    0.1  -607.30 2019-03-04    4       1      3    10  82f2d761  2019
4  AA0421    0    0.1  -610.21 2019-03-05    5       1      3    10  82f2d761  2019

print(the_big_df.info()) Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36054 entries, 0 to 36053
Data columns (total 11 columns):
account        36054 non-null object
apt            36054 non-null int64
apt_p          36054 non-null float64
balance        36054 non-null float64
date           36054 non-null datetime64[ns]
day            36054 non-null int64
flag           36054 non-null int64
month          36054 non-null int64
reps           36054 non-null int32
reqid          36054 non-null object
year           36054 non-null int64
dtypes: datetime64[ns](1), float64(2), int32(1), int64(5), object(2)
memory usage: 3.2+ MB

Here's the dataframe I'm passing to the join(); print(df_to_join.head(5)):

      reqid     id
0  54580f39  13301
1  3ba905c0  77114
2  5f2d80da  13302
3  a1478e98  77115
4  9b09854b  78598

print(df_to_join.info()) Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14332 entries, 0 to 14331
Data columns (total 2 columns):
reqid    14332 non-null object
dni      14332 non-null object

The exact next line after the 4 prints stated above is:

the_max_df = the_big_df.join(df_to_join,on='reqid')

And the output is, as stated above:

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

Why does this happen, when before it is clearly stated that column reqid is an object in both dataframes? Thanks.

like image 718
Celius Stingher Avatar asked Sep 04 '19 20:09

Celius Stingher


People also ask

How do you change the datatype of an object to int64 in Python?

Use pandas. Series. astype() twice to convert a DataFrame column from object to int.

How do I combine two columns in DF?

By use + operator simply you can combine/merge two or multiple text/string columns in pandas DataFrame. Note that when you apply + operator on numeric columns it actually does addition instead of concatenation.

Is join and merge same in pandas?

Pandas Join vs Merge Differences The main difference between join vs merge would be; join() is used to combine two DataFrames on the index but not on columns whereas merge() is primarily used to specify the columns you wanted to join on, this also supports joining on indexes and combination of index and columns.

Which command is used for combining merging DataFrames in pandas?

We can use the concat function in pandas to append either columns or rows from one DataFrame to another. Let's grab two subsets of our data to see how this works. When we concatenate DataFrames, we need to specify the axis. axis=0 tells pandas to stack the second DataFrame UNDER the first one.


1 Answers

The problem here is a misconception of how join works: when you say the_big_df.join(df_to_join,on='reqid') it does not mean join on the_big_df.reqid == df_to_join.reqid as one would assume at first glance, but rather join on the_big_df.reqid == df_to_join.index. As requid is of type object and the index is of type int64 you get the error.

See the docs for join:

Join columns with other DataFrame either on index or on a key column.
...
on : str, list of str, or array-like, optional
Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.

Look at the following example:

df1 = pd.DataFrame({'id1': [1, 2], 'val1': [11,12]})
df2 = pd.DataFrame({'id2': [3, 4], 'val2': [21,22]})
print(df1)
#   id1  val1
#0    1    11
#1    2    12
print(df2)
#   id2  val2
#0    3    21
#1    4    22

# join on df1.id1 (int64) == df2.index (int64) 
print(df1.join(df2, on='id1'))
#   id1  val1  id2  val2
#0    1    11  4.0  22.0
#1    2    12  NaN   NaN

# now df3 same as df1 but id3 as object:
df3 = pd.DataFrame({'id3': ['1', '2'], 'val1': [11,12]})

# try to join on df3.id3 (object) == df2.index (int64) 
df3.join(df2, on='id3')
#ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat


Please note: the above is for a modern versions of pandas. Version 20.3 gave the following result:
>>> df3.join(df2, on='id3')
  id3  val1  id2  val2
0   1    11  NaN   NaN
1   2    12  NaN   NaN
like image 184
Stef Avatar answered Oct 22 '22 20:10

Stef