Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SparkContext should only be created and accessed on the driver

I am using Azure Databricks (10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)) Standard_L8s with cores.

When executing the below code, getting SparkContext should only be created and accessed on the driver error. If I am using import pandas only it's running fine, but it takes more than 3 hrs. for me, I have billions of records to process. I have to tune this UDF please help in this.

import pyspark.pandas as pd
def getnearest_five_min_slot(valu):
  dataframe = pd.DataFrame([300,600,900,1200,1500,1800,2100,2400,2700,3000,3300,3600], columns = ['value'])
  rslt_df = dataframe.loc[dataframe['value'] >= value]
  rslt_df=rslt_df.sort_values(by=['value'], ascending=[True]).head(1)
  output=int(rslt_df.iat[0,0])
  print('\nResult dataframe :\n', output)
  
  return output
getnearestFiveMinSlot = udf(lambda m: getnearest_five_min_slot(m))

slotValue = [100,500,1100,400,601]
df = spark.createDataFrame(slotValue, IntegerType())
df=df.withColumn("NewValue",getnearestFiveMinSlot("value"))
display(df)
like image 453
Aravind Peddola Avatar asked Dec 22 '25 19:12

Aravind Peddola


1 Answers

I have added SparkSession to my script, and the error continues. What is weird in my case is that when I run the code on Databricks's Noteoboks is runs just fine, but when I try to run it in a .py script it raises this error.

like image 143
Michel Arruda Avatar answered Dec 24 '25 10:12

Michel Arruda



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!