Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to set spark.sql.shuffle.partitions when using the lastest spark version

I want to reset the spark.sql.shuffle.partitions configure in the pyspark code, since I need to join two big tables. But the following code doesn't not work in the latest spark version, the error says that "no method "setConf" in xxx"

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import pyspark
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

spark.sparkContext.setConf("spark.sql.shuffle.partitions", "1000")
spark.sparkContext.setConf("spark.default.parallelism", "1000")

# or using the follow, neither is working 
spark.setConf("spark.sql.shuffle.partitions", "1000")
spark.setConf("spark.default.parallelism", "1000")

I would like to know how to reset the "spark.sql.shuffle.partitions" now.

like image 734
pingping chen Avatar asked Dec 19 '22 03:12

pingping chen


1 Answers

SparkSession provides a RuntimeConfig interface to set and get Spark related parameters. The answer to your question would be:

spark.conf.set("spark.sql.shuffle.partitions", 1000)

Refer: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.RuntimeConfig

I've missed that your question was about pyspark. Pyspark has a similar interface spark.conf. Refer: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.conf

like image 129
Sai Kiriti Badam Avatar answered Apr 25 '23 16:04

Sai Kiriti Badam