I understand the basic theory of <code>textFile</code> generating partition for each file, while <code>wholeTextFiles</code> generates an RDD of pair values, where the key is the path of each file, the value is the content of each file. Now, from a technical point of view, what's the difference between : <pre class="prettyprint"><code>val textFile = sc.textFile("my/path/*.csv", 8) textFile.getNumPartitions </code></pre> and <pre class="prettyprint"><code>val textFile = sc.wholeTextFiles("my/path/*.csv",8) textFile.getNumPartitions </code></pre> In both methods I'm generating 8 partitions. So why should I use <code>wholeTextFiles</code> in the first place, and what's its benefit over <code>textFile</code>?

<blockquote> <code>textFile</code> generating partition for each file, while <code>wholeTextFiles</code> generates an RDD of pair values </blockquote> That's not accurate: <ol> <li><code>textFile</code> loads one or more files, with each line as a record in the resulting RDD. A single file might be split into several partitions if the file is large enough (depends on the number of partitions requested, Spark's default number of partitions, and the underlying File System). When loading multiple files at once, this operation "loses" the relation between a record and the file that contained it - i.e. there's no way to know which file contained which line. The order of the records in the RDD will follow the alphabetical order of files, and the order of records within the files (order is not "lost").</li> <li><code>wholeTextFiles</code> preserves the relation between data and the files that contained it, by loading the data into a <code>PairRDD</code> with one record per input file. The record will have the form <code>(fileName, fileContent)</code>. This means that loading large files is risky (might cause bad performance or <code>OutOfMemoryError</code> since each file will necessarily be stored on a single node). Partitioning is done based on user input or Spark's configuration - with multiple files potentially loaded into a single partition.</li> </ol> Generally speaking, <code>textFile</code> serves the common use case of just loading a lot of data (regardless of how it's broken-down into files). <code>readWholeFiles</code> should only be used if you actually need to know the originating file name of each record, and if you know all files are small enough.

As of Spark2.1.1 following is the code for textFile. <pre class="prettyprint"><code>def textFile( path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = withScope { assertNotStopped() hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], minPartitions).map(pair => pair._2.toString).setName(path) } </code></pre> Which internally uses hadoopFile to read either local files, HDFS files, and S3 using the pattern like <code>file://</code> , <code>hdfs://</code>, and <code>s3a://</code> Where as WholeTextFile the syntax is as below <pre class="prettyprint"><code>def wholeTextFiles( path: String, minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope </code></pre> If we observe the syntax for the both methods are equal, but textfile is useful to read the files, where as wholeTextFiles is used to read the directories of small files. How ever we can also use larger files but performance may effect. So when you want to deal with large files textFile is better option, whereas if we want to deal with directory of smaller files wholeTextFile is better

<ol> <li>textfile() reads a text file and returns an RDD of Strings. For example sc.textFile("/mydata.txt") will create RDD in which each individual line is an element.</li> <li>wholeTextFile() reads a directory of text files and returns pairRDD. For example, if there are few files in a directory, the wholeTextFile() method will create pair RDD with filename and path as key, and value being the whole file as string.</li> </ol>

Spark textFile vs wholeTextFiles

Tags:

file-io

scala

apache-spark

I understand the basic theory of textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.

Now, from a technical point of view, what's the difference between :

val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions

and

val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions

In both methods I'm generating 8 partitions. So why should I use wholeTextFiles in the first place, and what's its benefit over textFile?

567

asked Nov 06 '17 04:11

Dan

4 Answers

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile.

When reading uncompressed files with textFile, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles should be used.

wholeTextFiles will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.

104

answered Oct 17 '22 02:10

Shaido

textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values

That's not accurate:

textFile loads one or more files, with each line as a record in the resulting RDD. A single file might be split into several partitions if the file is large enough (depends on the number of partitions requested, Spark's default number of partitions, and the underlying File System). When loading multiple files at once, this operation "loses" the relation between a record and the file that contained it - i.e. there's no way to know which file contained which line. The order of the records in the RDD will follow the alphabetical order of files, and the order of records within the files (order is not "lost").
wholeTextFiles preserves the relation between data and the files that contained it, by loading the data into a PairRDD with one record per input file. The record will have the form (fileName, fileContent). This means that loading large files is risky (might cause bad performance or OutOfMemoryError since each file will necessarily be stored on a single node). Partitioning is done based on user input or Spark's configuration - with multiple files potentially loaded into a single partition.

Generally speaking, textFile serves the common use case of just loading a lot of data (regardless of how it's broken-down into files). readWholeFiles should only be used if you actually need to know the originating file name of each record, and if you know all files are small enough.

answered Oct 17 '22 02:10

Tzach Zohar

As of Spark2.1.1 following is the code for textFile.

def textFile(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()

hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
  minPartitions).map(pair => pair._2.toString).setName(path)  }

Which internally uses hadoopFile to read either local files, HDFS files, and S3 using the pattern like file:// , hdfs://, and s3a://

Where as WholeTextFile the syntax is as below

def wholeTextFiles(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope

If we observe the syntax for the both methods are equal, but textfile is useful to read the files, where as wholeTextFiles is used to read the directories of small files. How ever we can also use larger files but performance may effect.
So when you want to deal with large files textFile is better option, whereas if we want to deal with directory of smaller files wholeTextFile is better

answered Oct 17 '22 02:10

Sainagaraju Vaduka

textfile() reads a text file and returns an RDD of Strings. For example sc.textFile("/mydata.txt") will create RDD in which each individual line is an element.
wholeTextFile() reads a directory of text files and returns pairRDD. For example, if there are few files in a directory, the wholeTextFile() method will create pair RDD with filename and path as key, and value being the whole file as string.

answered Oct 17 '22 02:10

KayV

Related questions
                            
                                spark - scala - How can I check if a table exists in hive
                            
                                Scheduled Executor in Scala
                            
                                Why is compilation very slow for Scala programs?
                            
                                Why does Haskell's foldr NOT stackoverflow while the same Scala implementation does?
                            
                                How does the following Java "continue" code translate to Scala?
                            
                                How to run main class in test folder?
                            
                                How to use scala.collection.immutable.List in a Java code
                            
                                Why would I want .union over .unionAll in Spark for SchemaRDDs?
                            
                                Scala way to program bunch of if's
                            
                                Akka Event Bus Tutorial [closed]
                            
                                Performance of Scala for Android
                            
                                Scala: Silently catch all exceptions
                            
                                Scala: Producing the intermediate results of a fold
                            
                                "host not allowed" error when deploying a play framework application to Amazon AWS with Boxfuse
                            
                                Unresolved dependency SBT 0.13.0 after update
                            
                                object xml is not a member of package scala
                            
                                Scala - calculate average of SomeObj.double in a List[SomeObj]
                            
                                Scala regex ignorecase
                            
                                Flatten Scala Try
                            
                                Why I can't execute scala file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark textFile vs wholeTextFiles

Tags:

file-io

scala

apache-spark

Dan

People also ask

4 Answers

Shaido

Tzach Zohar

Sainagaraju Vaduka

KayV

Recent Activity

Donate For Us