<pre class="prettyprint"><code>scala> val p=sc.textFile("file:///c:/_home/so-posts.xml", 8) //i've 8 cores p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21 scala> p.partitions.size res33: Int = 729 </code></pre> I was expecting 8 to be printed and I see 729 tasks in Spark UI EDIT: After calling <code>repartition()</code> as suggested by @zero323 <pre class="prettyprint"><code>scala> p1 = p.repartition(8) scala> p1.partitions.size res60: Int = 8 scala> p1.count </code></pre> I still see 729 tasks in the Spark UI even though the spark-shell prints 8.

@zero323 nailed it, but I thought I'd add a bit more (low-level) background on how this <code>minPartitions</code> input parameter influences the number of partitions. tl;dr The partition parameter does have an effect on <code>SparkContext.textFile</code> as the minimum (not the exact!) number of partitions. In this particular case of using SparkContext.textFile, the number of partitions are calculated directly by org.apache.hadoop.mapred.TextInputFormat.getSplits(jobConf, minPartitions) that is used by <code>textFile</code>. <code>TextInputFormat</code> only knows how to partition (aka split) the distributed data with Spark only following the advice. From Hadoop's FileInputFormat's javadoc: <blockquote> FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers. </blockquote> It is a very good example how Spark leverages Hadoop API. BTW, You may find the sources enlightening ;-)

Why does partition parameter of SparkContext.textFile not take effect?

Tags:

scala

apache-spark

rdd

scala> val p=sc.textFile("file:///c:/_home/so-posts.xml", 8) //i've 8 cores
p: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at textFile at <console>:21

scala> p.partitions.size
res33: Int = 729

I was expecting 8 to be printed and I see 729 tasks in Spark UI

EDIT:

After calling repartition() as suggested by @zero323

scala> p1 = p.repartition(8)
scala> p1.partitions.size
res60: Int = 8
scala> p1.count

I still see 729 tasks in the Spark UI even though the spark-shell prints 8.

567

asked Dec 26 '15 00:12

Aravind Yarram

2 Answers

If you take a look at the signature

textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

you'll see that the argument you use is called minPartitions and this pretty much describes its function. In some cases even that is ignored but it is a different matter. Input format which is used behind the scenes still decides how to compute splits.

In this particular case you could probably use mapred.min.split.size to increase split size (this will work during load) or simply repartition after loading (this will take effect after data is loaded) but in general there should be no need for that.

185

answered Sep 19 '22 14:09

zero323

@zero323 nailed it, but I thought I'd add a bit more (low-level) background on how this minPartitions input parameter influences the number of partitions.

tl;dr The partition parameter does have an effect on SparkContext.textFile as the minimum (not the exact!) number of partitions.

In this particular case of using SparkContext.textFile, the number of partitions are calculated directly by org.apache.hadoop.mapred.TextInputFormat.getSplits(jobConf, minPartitions) that is used by textFile. TextInputFormat only knows how to partition (aka split) the distributed data with Spark only following the advice.

From Hadoop's FileInputFormat's javadoc:

FileInputFormat is the base class for all file-based InputFormats. This provides a generic implementation of getSplits(JobConf, int). Subclasses of FileInputFormat can also override the isSplitable(FileSystem, Path) method to ensure input-files are not split-up and are processed as a whole by Mappers.

It is a very good example how Spark leverages Hadoop API.

BTW, You may find the sources enlightening ;-)

answered Sep 21 '22 14:09

Jacek Laskowski

Related questions
                            
                                How to write to HDFS using Scala
                            
                                ClassNotFoundException anonfun when deploy scala code to Spark
                            
                                Round Down Double in Spark
                            
                                What is the right way to edit spark-env.sh before running spark-shell?
                            
                                Spark Scala: Task Not serializable error
                            
                                When is Scala 2.8.0 going to be released? [closed]
                            
                                method names with fluent interface
                            
                                Java or Scala for new greenfield projects?
                            
                                Ant task for compiling GUI forms (Intellij IDEA)
                            
                                How do I convert Array[Node] to NodeSeq?
                            
                                Renaming classOf in Scala
                            
                                Scala: what do ":=" and "::=" operator do?
                            
                                Naming scheme for helper functions in Scala
                            
                                What is `self` in Akka?
                            
                                Does scala/java has something like StringIO from python?
                            
                                How to create type aliases to specific Map types in Scala
                            
                                How to Display exception thrown in "should produce [exception]” in ScalaTest
                            
                                How expensive is type inference?
                            
                                Spark job is failed due to java.io.NotSerializableException: org.apache.spark.SparkContext
                            
                                Using Scala & IntelliJ, show unicode arrows but don't change source-code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With