The Scala version of SparkContext has the property
sc.hadoopConfiguration
I have successfully used that to set Hadoop properties (in Scala)
e.g.
sc.hadoopConfiguration.set("my.mapreduce.setting","someVal")
However the python version of SparkContext lacks that accessor. Is there any way to set Hadoop configuration values into the Hadoop Configuration used by the PySpark context?
PySpark has a JVM running behind the scenes to actually run the spark code. sc. _jvm is the gateway into said JVM and sc. _jsc is the Java Spark Context which is a proxy into the SparkContext in that JVM.
A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Only one SparkContext should be active per JVM. You must stop() the active SparkContext before creating a new one.
PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.
sc._jsc.hadoopConfiguration().set('my.mapreduce.setting', 'someVal')
should work
You can set any Hadoop properties using the --conf
parameter while submitting the job.
--conf "spark.hadoop.fs.mapr.trace=debug"
Source: https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L105
I looked into the PySpark source code (context.py) and there is not a direct equivalent. Instead some specific methods support sending in a map of (key,value) pairs:
fileLines = sc.newAPIHadoopFile('dev/*',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'mapreduce.input.fileinputformat.input.dir.recursive':'true'}
).count()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With