I am trying to use spark.read to get file count inside my UDF, but when i execute the program hangs at that point.
i am calling an UDF in withcolumn of dataframe. the udf has to read a file and return a count of it. But it is not working. i am passing a variable value to UDF function. when i remove the spark.read code and simply return a number it works. but spark.read is not working through UDF
def prepareRowCountfromParquet(jobmaster_pa: String)(implicit spark: SparkSession): Int = {
print("The variable value is " + jobmaster_pa)
print("the count is " + spark.read.format("csv").option("header", "true").load(jobmaster_pa).count().toInt)
spark.read.format("csv").option("header", "true").load(jobmaster_pa).count().toInt
}
val SRCROWCNT = udf(prepareRowCountfromParquet _)
df
.withColumn("SRC_COUNT", SRCROWCNT(lit(keyPrefix)))
SRC_COUNT column should get lines of the file
UDFs cannot use spark context as it exists only in the driver and it isn't serializable.
generally speaking, you need to read all the csvs, calc the counts using a groupBy and then you can do a left join to the df
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With