SparkContext should only be created and accessed on the driver

Question

I am using Azure Databricks (10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)) Standard_L8s with cores.

When executing the below code, getting SparkContext should only be created and accessed on the driver error. If I am using import pandas only it's running fine, but it takes more than 3 hrs. for me, I have billions of records to process. I have to tune this UDF please help in this.

import pyspark.pandas as pd
def getnearest_five_min_slot(valu):
  dataframe = pd.DataFrame([300,600,900,1200,1500,1800,2100,2400,2700,3000,3300,3600], columns = ['value'])
  rslt_df = dataframe.loc[dataframe['value'] >= value]
  rslt_df=rslt_df.sort_values(by=['value'], ascending=[True]).head(1)
  output=int(rslt_df.iat[0,0])
  print('
Result dataframe :
', output)
  
  return output
getnearestFiveMinSlot = udf(lambda m: getnearest_five_min_slot(m))

slotValue = [100,500,1100,400,601]
df = spark.createDataFrame(slotValue, IntegerType())
df=df.withColumn("NewValue",getnearestFiveMinSlot("value"))
display(df)

Michel Arruda · Accepted Answer

I have added SparkSession to my script, and the error continues. What is weird in my case is that when I run the code on Databricks's Noteoboks is runs just fine, but when I try to run it in a .py script it raises this error.

SparkContext should only be created and accessed on the driver

Tags:

pyspark

azure-databricks

Aravind Peddola

1 Answers

Michel Arruda

Recent Activity

Donate For Us

SparkContext should only be created and accessed on the driver

Tags:

pyspark

azure-databricks

Aravind Peddola

1 Answers

Michel Arruda

Related questions

Recent Activity

Donate For Us