Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark read doesn't work inside Scala UDF function

I am trying to use spark.read to get file count inside my UDF, but when i execute the program hangs at that point.

i am calling an UDF in withcolumn of dataframe. the udf has to read a file and return a count of it. But it is not working. i am passing a variable value to UDF function. when i remove the spark.read code and simply return a number it works. but spark.read is not working through UDF

def prepareRowCountfromParquet(jobmaster_pa: String)(implicit spark: SparkSession): Int = {
      print("The variable value is " + jobmaster_pa)
      print("the count is " + spark.read.format("csv").option("header", "true").load(jobmaster_pa).count().toInt)
      spark.read.format("csv").option("header", "true").load(jobmaster_pa).count().toInt
    }
val SRCROWCNT = udf(prepareRowCountfromParquet _)

  df
  .withColumn("SRC_COUNT", SRCROWCNT(lit(keyPrefix))) 

SRC_COUNT column should get lines of the file

like image 226
senthilnathan Avatar asked Jun 07 '26 06:06

senthilnathan


1 Answers

UDFs cannot use spark context as it exists only in the driver and it isn't serializable.

generally speaking, you need to read all the csvs, calc the counts using a groupBy and then you can do a left join to the df

like image 95
Arnon Rotem-Gal-Oz Avatar answered Jun 10 '26 06:06

Arnon Rotem-Gal-Oz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!