I have pandas udf defined below
schema2 = StructType([ StructField('sensorid', IntegerType(), True),
StructField('confidence', DoubleType(), True)])
@pandas_udf(schema2, PandasUDFType.GROUPED_MAP)
def PreProcess(Indf):
confidence=1
sensor=Indf.iloc[0,0]
df = pd.DataFrame(columns=['sensorid','confidence'])
df['sensorid']=[sensor]
df['confidence']=[0]
return df
I am then passing a spark dataframe with 3 columns into that udf
results.groupby("sensorid").apply(PreProcess)
results:
+--------+---------------+---------------+
|sensorid|sensortimestamp|calculatedvalue|
+--------+---------------+---------------+
| 397332| 1596518086| -39.0|
| 397332| 1596525586| -31.0|
But I keep getting this error:
RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema.Expected: 3 Actual: 4
I can tell what the error is trying to say but I don't understand how this error can pop up. I thought I am returning the correct 2 columns of the dataframe specified in the struct
apply
is deprecated and it seems that expects to return the same input columns, in this case 3. Try to use applyInPandas
with the expected output schema:
results.groupby("sensorid").applyInPandas(PreProcess, schema=schema2)
Updated links with latest version. (Spark's doc change and links were broken)
In version 3.0.0: apply
applyInPandas
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With