Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Pandas Join on String Datatype




I am trying to join two pandas dataframes on an id field which is a string uuid. I get a Value error:

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

The code is below. I am trying to convert the fields to string as per Trying to merge 2 dataframes but get ValueError but the error remains. Note that pdf is coming from a spark dataframe.toPandas() while outputsPdf is created from a dictionary.

pdf.id = pdf.id.apply(str)
outputsPdf.id = outputsPdf.id.apply(str)
inOutPdf = pdf.join(outputsPdf, on='id', how='left', rsuffix='fs')

id         object
time      float64
height    float32
dtype: object

id         object
labels    float64
dtype: object

How can I debug this? Full Traceback:

ValueError                                Traceback (most recent call last)
<ipython-input-13-deb429dde9ad> in <module>()
     61 pdf['id'] = pdf['id'].astype(str)
     62 outputsPdf['id'] = outputsPdf['id'].astype(str)
---> 63 inOutPdf = pdf.join(outputsPdf, on=['id'], how='left', rsuffix='fs')
     65 # idSparkDf = spark.createDataFrame(idPandasDf, schema=StructType([StructField('id', StringType(), True),

~/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
   6334         # For SparseDataFrame's benefit
   6335         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 6336                                  rsuffix=rsuffix, sort=sort)
   6338     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

~/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   6349             return merge(self, other, left_on=on, how=how,
   6350                          left_index=on is None, right_index=True,
-> 6351                          suffixes=(lsuffix, rsuffix), sort=sort)
   6352         else:
   6353             if on is not None:

~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     59                          right_index=right_index, sort=sort, suffixes=suffixes,
     60                          copy=copy, indicator=indicator,
---> 61                          validate=validate)
     62     return op.get_result()

~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    553         # validate the merge keys dtypes. We may need to coerce
    554         # to avoid incompat dtypes
--> 555         self._maybe_coerce_merge_keys()
    557         # If argument passed to validate,

~/miniconda3/lib/python3.6/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
    984             elif (not is_numeric_dtype(lk)
    985                     and (is_numeric_dtype(rk) and not is_bool_dtype(rk))):
--> 986                 raise ValueError(msg)
    987             elif is_datetimelike(lk) and not is_datetimelike(rk):
    988                 raise ValueError(msg)
like image 285
Paul Bendevis Avatar asked Sep 17 '18 17:09

Paul Bendevis

People also ask

How do I join strings in pandas?

Pandas str.cat() is used to concatenate strings to the passed caller series of string. Distinct values from a different series can be passed but the length of both the series has to be same. . str has to be prefixed to differentiate it from the Python's default method.

What is the pandas Dtype for storing string data?

Pandas uses the object dtype for storing strings.

Is merge or join faster pandas?

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes. Set your merge columns as index, and use df1.

1 Answers

The on parameter only applies to the calling DataFrame!

on: Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index.

Though you specify on='id' it will use the 'id' in pdf, which is an object and attempt to join that with the index of outputsPdf, which takes integer values.

If you need to join on non-index columns across two DataFrames you can either set them to the index, or you must use merge as the on paremeter in pd.merge applies to both DataFrames.


import pandas as pd

df1 = pd.DataFrame({'id': ['1', 'True', '4'], 'vals': [10, 11, 12]})
df2 = df1.copy()

df1.join(df2, on='id', how='left', rsuffix='_fs')

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

On the other hand, these work:

df1.set_index('id').join(df2.set_index('id'), how='left', rsuffix='_fs').reset_index()
#     id  vals  vals_fs
#0     1    10       10
#1  True    11       11
#2     4    12       12

df1.merge(df2, on='id', how='left', suffixes=['', '_fs'])
#     id  vals  vals_fs
#0     1    10       10
#1  True    11       11
#2     4    12       12
like image 92
ALollz Avatar answered Sep 22 '22 17:09
