I am trying to join two pandas dataframes on an id field which is a string uuid. I get a Value error: ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat The code is below. I am trying to convert the fields to string as per Trying to merge 2 dataframes but get ValueError but the error remains. Note that pdf is coming from a spark <code>dataframe.toPandas()</code> while outputsPdf is created from a dictionary. <pre class="prettyprint"><code>pdf.id = pdf.id.apply(str) outputsPdf.id = outputsPdf.id.apply(str) inOutPdf = pdf.join(outputsPdf, on='id', how='left', rsuffix='fs') pdf.dtypes id object time float64 height float32 dtype: object outputsPdf.dtypes id object labels float64 dtype: object </code></pre> How can I debug this? Full Traceback: <pre class="prettyprint"><code>ValueError Traceback (most recent call last) <ipython-input-13-deb429dde9ad> in <module>() 61 pdf['id'] = pdf['id'].astype(str) 62 outputsPdf['id'] = outputsPdf['id'].astype(str) ---> 63 inOutPdf = pdf.join(outputsPdf, on=['id'], how='left', rsuffix='fs') 64 65 # idSparkDf = spark.createDataFrame(idPandasDf, schema=StructType([StructField('id', StringType(), True), ~/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py in join(self, other, on, how, lsuffix, rsuffix, sort) 6334 # For SparseDataFrame's benefit 6335 return self._join_compat(other, on=on, how=how, lsuffix=lsuffix, -> 6336 rsuffix=rsuffix, sort=sort) 6337 6338 def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='', ~/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort) 6349 return merge(self, other, left_on=on, how=how, 6350 left_index=on is None, right_index=True, -> 6351 suffixes=(lsuffix, rsuffix), sort=sort) 6352 else: 6353 if on is not None: ~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 59 right_index=right_index, sort=sort, suffixes=suffixes, 60 copy=copy, indicator=indicator, ---> 61 validate=validate) 62 return op.get_result() 63 ~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate) 553 # validate the merge keys dtypes. We may need to coerce 554 # to avoid incompat dtypes --> 555 self._maybe_coerce_merge_keys() 556 557 # If argument passed to validate, ~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self) 984 elif (not is_numeric_dtype(lk) 985 and (is_numeric_dtype(rk) and not is_bool_dtype(rk))): --> 986 raise ValueError(msg) 987 elif is_datetimelike(lk) and not is_datetimelike(rk): 988 raise ValueError(msg) </code></pre>

The <code>on</code> parameter only applies to the calling DataFrame! <blockquote> <code>on</code>: Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. </blockquote> Though you specify <code>on='id'</code> it will use the <code>'id'</code> in pdf, which is an object and attempt to join that with the index of outputsPdf, which takes integer values. If you need to <code>join</code> on non-index columns across two DataFrames you can either set them to the index, or you must use <code>merge</code> as the <code>on</code> paremeter in <code>pd.merge</code> applies to both DataFrames. <hr> <h3>Example</h3> <pre class="prettyprint"><code>import pandas as pd df1 = pd.DataFrame({'id': ['1', 'True', '4'], 'vals': [10, 11, 12]}) df2 = df1.copy() df1.join(df2, on='id', how='left', rsuffix='_fs') </code></pre> <blockquote> ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat </blockquote> On the other hand, these work: <pre class="prettyprint"><code>df1.set_index('id').join(df2.set_index('id'), how='left', rsuffix='_fs').reset_index() # id vals vals_fs #0 1 10 10 #1 True 11 11 #2 4 12 12 df1.merge(df2, on='id', how='left', suffixes=['', '_fs']) # id vals vals_fs #0 1 10 10 #1 True 11 11 #2 4 12 12 </code></pre>

Pandas Join on String Datatype

Tags:

python

pandas

I am trying to join two pandas dataframes on an id field which is a string uuid. I get a Value error:

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

The code is below. I am trying to convert the fields to string as per Trying to merge 2 dataframes but get ValueError but the error remains. Note that pdf is coming from a spark dataframe.toPandas() while outputsPdf is created from a dictionary.

pdf.id = pdf.id.apply(str)
outputsPdf.id = outputsPdf.id.apply(str)
inOutPdf = pdf.join(outputsPdf, on='id', how='left', rsuffix='fs')

pdf.dtypes
id         object
time      float64
height    float32
dtype: object

outputsPdf.dtypes
id         object
labels    float64
dtype: object

How can I debug this? Full Traceback:

ValueError                                Traceback (most recent call last)
<ipython-input-13-deb429dde9ad> in <module>()
     61 pdf['id'] = pdf['id'].astype(str)
     62 outputsPdf['id'] = outputsPdf['id'].astype(str)
---> 63 inOutPdf = pdf.join(outputsPdf, on=['id'], how='left', rsuffix='fs')
     64 
     65 # idSparkDf = spark.createDataFrame(idPandasDf, schema=StructType([StructField('id', StringType(), True),

~/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
   6334         # For SparseDataFrame's benefit
   6335         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 6336                                  rsuffix=rsuffix, sort=sort)
   6337 
   6338     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

~/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   6349             return merge(self, other, left_on=on, how=how,
   6350                          left_index=on is None, right_index=True,
-> 6351                          suffixes=(lsuffix, rsuffix), sort=sort)
   6352         else:
   6353             if on is not None:

~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     59                          right_index=right_index, sort=sort, suffixes=suffixes,
     60                          copy=copy, indicator=indicator,
---> 61                          validate=validate)
     62     return op.get_result()
     63 

~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    553         # validate the merge keys dtypes. We may need to coerce
    554         # to avoid incompat dtypes
--> 555         self._maybe_coerce_merge_keys()
    556 
    557         # If argument passed to validate,

~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
    984             elif (not is_numeric_dtype(lk)
    985                     and (is_numeric_dtype(rk) and not is_bool_dtype(rk))):
--> 986                 raise ValueError(msg)
    987             elif is_datetimelike(lk) and not is_datetimelike(rk):
    988                 raise ValueError(msg)

285

asked Sep 17 '18 17:09

Paul Bendevis

1 Answers

The on parameter only applies to the calling DataFrame!

on: Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.

Though you specify on='id' it will use the 'id' in pdf, which is an object and attempt to join that with the index of outputsPdf, which takes integer values.

If you need to join on non-index columns across two DataFrames you can either set them to the index, or you must use merge as the on paremeter in pd.merge applies to both DataFrames.

Example

import pandas as pd

df1 = pd.DataFrame({'id': ['1', 'True', '4'], 'vals': [10, 11, 12]})
df2 = df1.copy()

df1.join(df2, on='id', how='left', rsuffix='_fs')

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

On the other hand, these work:

df1.set_index('id').join(df2.set_index('id'), how='left', rsuffix='_fs').reset_index()
#     id  vals  vals_fs
#0     1    10       10
#1  True    11       11
#2     4    12       12

df1.merge(df2, on='id', how='left', suffixes=['', '_fs'])
#     id  vals  vals_fs
#0     1    10       10
#1  True    11       11
#2     4    12       12

answered Sep 22 '22 17:09

ALollz

Related questions
                            
                                cannot import wsgi from gevent
                            
                                applying lambda row on multiple columns pandas
                            
                                ValueError uses no argument in pytest, does order of decorators matter?
                            
                                How to get dict of first two indexes for multi index data frame
                            
                                Numpy arrays vs Python arrays [duplicate]
                            
                                ImportError: No module named gspread
                            
                                Python str() vs. '' - which is preferred
                            
                                Extract string if match the value in another list
                            
                                matplotlib: Tick labels disappeared after set sharex in subplots [duplicate]
                            
                                NetworkX Key Error when writing GML file
                            
                                How to annotate that a classmethod returns an instance of that class [duplicate]
                            
                                using the timedelta.round() function
                            
                                Grouping import statements in python
                            
                                How to make video from an updating numpy array in Python
                            
                                how is asyncio.sleep() in python implemented?
                            
                                Generate a list a(n) is not of the form prime + a(k), k < n
                            
                                Python: how to replace NaN with conditions in a dataframe?
                            
                                Python : How to make label bold in kivy
                            
                                Speed of np.empty vs np.zeros
                            
                                Using pytest's parametrize, how can I skip the remaining tests if one test case fails?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With