I am trying to join two pandas dataframes on an id field which is a string uuid. I get a Value error:
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
The code is below. I am trying to convert the fields to string as per Trying to merge 2 dataframes but get ValueError but the error remains. Note that pdf is coming from a spark dataframe.toPandas()
while outputsPdf is created from a dictionary.
pdf.id = pdf.id.apply(str)
outputsPdf.id = outputsPdf.id.apply(str)
inOutPdf = pdf.join(outputsPdf, on='id', how='left', rsuffix='fs')
pdf.dtypes
id object
time float64
height float32
dtype: object
outputsPdf.dtypes
id object
labels float64
dtype: object
How can I debug this? Full Traceback:
ValueError Traceback (most recent call last)
<ipython-input-13-deb429dde9ad> in <module>()
61 pdf['id'] = pdf['id'].astype(str)
62 outputsPdf['id'] = outputsPdf['id'].astype(str)
---> 63 inOutPdf = pdf.join(outputsPdf, on=['id'], how='left', rsuffix='fs')
64
65 # idSparkDf = spark.createDataFrame(idPandasDf, schema=StructType([StructField('id', StringType(), True),
~/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
6334 # For SparseDataFrame's benefit
6335 return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 6336 rsuffix=rsuffix, sort=sort)
6337
6338 def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',
~/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
6349 return merge(self, other, left_on=on, how=how,
6350 left_index=on is None, right_index=True,
-> 6351 suffixes=(lsuffix, rsuffix), sort=sort)
6352 else:
6353 if on is not None:
~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
59 right_index=right_index, sort=sort, suffixes=suffixes,
60 copy=copy, indicator=indicator,
---> 61 validate=validate)
62 return op.get_result()
63
~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
553 # validate the merge keys dtypes. We may need to coerce
554 # to avoid incompat dtypes
--> 555 self._maybe_coerce_merge_keys()
556
557 # If argument passed to validate,
~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
984 elif (not is_numeric_dtype(lk)
985 and (is_numeric_dtype(rk) and not is_bool_dtype(rk))):
--> 986 raise ValueError(msg)
987 elif is_datetimelike(lk) and not is_datetimelike(rk):
988 raise ValueError(msg)
Pandas str.cat() is used to concatenate strings to the passed caller series of string. Distinct values from a different series can be passed but the length of both the series has to be same. . str has to be prefixed to differentiate it from the Python's default method.
Pandas uses the object dtype for storing strings.
As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes. Set your merge columns as index, and use df1.
The on
parameter only applies to the calling DataFrame!
on
: Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.
Though you specify on='id'
it will use the 'id'
in pdf, which is an object and attempt to join that with the index of outputsPdf, which takes integer values.
If you need to join
on non-index columns across two DataFrames you can either set them to the index, or you must use merge
as the on
paremeter in pd.merge
applies to both DataFrames.
import pandas as pd
df1 = pd.DataFrame({'id': ['1', 'True', '4'], 'vals': [10, 11, 12]})
df2 = df1.copy()
df1.join(df2, on='id', how='left', rsuffix='_fs')
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
On the other hand, these work:
df1.set_index('id').join(df2.set_index('id'), how='left', rsuffix='_fs').reset_index()
# id vals vals_fs
#0 1 10 10
#1 True 11 11
#2 4 12 12
df1.merge(df2, on='id', how='left', suffixes=['', '_fs'])
# id vals vals_fs
#0 1 10 10
#1 True 11 11
#2 4 12 12
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With