Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark pandas udf RuntimeError: Number of columns of the returned doesn't match specified schema

I have pandas udf defined below

schema2 = StructType([   StructField('sensorid', IntegerType(), True),
    StructField('confidence', DoubleType(), True)]) 

@pandas_udf(schema2,  PandasUDFType.GROUPED_MAP)   
def PreProcess(Indf):   
    confidence=1  
    sensor=Indf.iloc[0,0]   
    df = pd.DataFrame(columns=['sensorid','confidence'])  
    df['sensorid']=[sensor]   
    df['confidence']=[0]   
    return df

I am then passing a spark dataframe with 3 columns into that udf

results.groupby("sensorid").apply(PreProcess)

results:
+--------+---------------+---------------+
|sensorid|sensortimestamp|calculatedvalue|
+--------+---------------+---------------+
|  397332|     1596518086|          -39.0|
|  397332|     1596525586|          -31.0|

But I keep getting this error:

RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema.Expected: 3 Actual: 4

I can tell what the error is trying to say but I don't understand how this error can pop up. I thought I am returning the correct 2 columns of the dataframe specified in the struct

like image 773
tyringtocode Avatar asked Nov 16 '22 06:11

tyringtocode


1 Answers

apply is deprecated and it seems that expects to return the same input columns, in this case 3. Try to use applyInPandas with the expected output schema:

results.groupby("sensorid").applyInPandas(PreProcess, schema=schema2)

Updated links with latest version. (Spark's doc change and links were broken)

In version 3.0.0: apply applyInPandas

like image 122
Daniel Argüelles Avatar answered Mar 29 '23 23:03

Daniel Argüelles