The "old" <code>SparkContext.hadoopFile</code> takes a <code>minPartitions</code> argument, which is a hint for the number of partitions: <pre class="prettyprint"><code>def hadoopFile[K, V]( path: String, inputFormatClass: Class[_ <: InputFormat[K, V]], keyClass: Class[K], valueClass: Class[V], minPartitions: Int = defaultMinPartitions ): RDD[(K, V)] </code></pre> But there is no such argument on <code>SparkContext.newAPIHadoopFile</code>: <pre class="prettyprint"><code>def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]]( path: String, fClass: Class[F], kClass: Class[K], vClass: Class[V], conf: Configuration = hadoopConfiguration): RDD[(K, V)] </code></pre> In fact <code>mapred.InputFormat.getSplits</code> takes a hint argument, but <code>mapreduce.InputFormat.getSplits</code> takes a <code>JobContext</code>. What is the way to influence the number of splits through the new API? I have tried setting <code>mapreduce.input.fileinputformat.split.maxsize</code> and <code>fs.s3n.block.size</code> on the <code>Configuration</code> object, but they had no effect. I am trying to load a 4.5 GB file from <code>s3n</code>, and it gets loaded in a single task. https://issues.apache.org/jira/browse/HADOOP-5861 is relevant, but it suggests that I should already see more than one split, since the default block size is 64 MB.

The function <code>newApiHadoopFile</code> allows you to pass a configuration object so in that you can set <code>mapred.max.split.size</code>. Even though this is in the <code>mapred</code> namespace since there is seemingly no new option I would imagine the new API will respect the variable.

How to set the number of partitions for newAPIHadoopFile?

Tags:

apache-spark

hadoop

The "old" SparkContext.hadoopFile takes a minPartitions argument, which is a hint for the number of partitions:

def hadoopFile[K, V](
  path: String,
  inputFormatClass: Class[_ <: InputFormat[K, V]],
  keyClass: Class[K],
  valueClass: Class[V],
  minPartitions: Int = defaultMinPartitions
  ): RDD[(K, V)]

But there is no such argument on SparkContext.newAPIHadoopFile:

def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
  path: String,
  fClass: Class[F],
  kClass: Class[K],
  vClass: Class[V],
  conf: Configuration = hadoopConfiguration): RDD[(K, V)]

In fact mapred.InputFormat.getSplits takes a hint argument, but mapreduce.InputFormat.getSplits takes a JobContext. What is the way to influence the number of splits through the new API?

I have tried setting mapreduce.input.fileinputformat.split.maxsize and fs.s3n.block.size on the Configuration object, but they had no effect. I am trying to load a 4.5 GB file from s3n, and it gets loaded in a single task.

https://issues.apache.org/jira/browse/HADOOP-5861 is relevant, but it suggests that I should already see more than one split, since the default block size is 64 MB.

810

asked Aug 22 '14 07:08

Daniel Darabos

1 Answers

The function newApiHadoopFile allows you to pass a configuration object so in that you can set mapred.max.split.size.

Even though this is in the mapred namespace since there is seemingly no new option I would imagine the new API will respect the variable.

194

answered Oct 04 '22 02:10

aaronman

Related questions
                            
                                where does oozie stores the captured output values of the Java action (or) any action
                            
                                Hadoop - java.net.ConnectException: Connection refused
                            
                                Hadoop MapReduce job I/O Exception due to premature EOF from inputStream
                            
                                Replace HDFS form local disk to s3 getting error (org.apache.hadoop.service.AbstractService)
                            
                                Using reserved words in Hive
                            
                                Not able to retrieve data from SparkR created DataFrame
                            
                                Hadoop NoSuchMethodError apache.commons.cli
                            
                                SparkSQL on HBase Tables
                            
                                How can I partition a hive table by (only) a portion of a timestamp column?
                            
                                Table loaded through Spark not accessible in Hive
                            
                                Running a standalone Hadoop application on multiple CPU cores
                            
                                Custom MapReduce Input Format - Can't Find Constructor
                            
                                Does HDFS encrypt or compress the data while storing?
                            
                                Workflow tool comaparison: Oozie Vs Cascading
                            
                                Apache Pig: Load a file that shows fine using hadoop fs -text
                            
                                hadoop: Reducer output to another Reducer
                            
                                Can we store relational data in hdfs
                            
                                Hadoop Datanode slave is not connecting to my master
                            
                                How to bundle many files in S3 using Spark
                            
                                In which folder or where actually the fsimage and edit log files are stored for the namenode to read and merge during the startup?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With