I understand the basic theory of textFile
generating partition for each file, while wholeTextFiles
generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.
Now, from a technical point of view, what's the difference between :
val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions
and
val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions
In both methods I'm generating 8 partitions. So why should I use wholeTextFiles
in the first place, and what's its benefit over textFile
?
textFile (name, minPartitions=None, use_unicode=True)[source] Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. The text files must be encoded as UTF-8.
Spark Read multiple text files into a single RDD When you know the names of the multiple files you would like to read, just input all file names with comma separator in order to create a single RDD. This read file text01. txt & text02. txt files and outputs below content.
The main difference, as you mentioned, is that textFile
will return an RDD with each line as an element while wholeTextFiles
returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile
.
When reading uncompressed files with textFile
, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles
should be used.
wholeTextFiles
will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.
textFile
generating partition for each file, whilewholeTextFiles
generates an RDD of pair values
That's not accurate:
textFile
loads one or more files, with each line as a record in the resulting RDD. A single file might be split into several partitions if the file is large enough (depends on the number of partitions requested, Spark's default number of partitions, and the underlying File System). When loading multiple files at once, this operation "loses" the relation between a record and the file that contained it - i.e. there's no way to know which file contained which line. The order of the records in the RDD will follow the alphabetical order of files, and the order of records within the files (order is not "lost").
wholeTextFiles
preserves the relation between data and the files that contained it, by loading the data into a PairRDD
with one record per input file. The record will have the form (fileName, fileContent)
. This means that loading large files is risky (might cause bad performance or OutOfMemoryError
since each file will necessarily be stored on a single node). Partitioning is done based on user input or Spark's configuration - with multiple files potentially loaded into a single partition.
Generally speaking, textFile
serves the common use case of just loading a lot of data (regardless of how it's broken-down into files). readWholeFiles
should only be used if you actually need to know the originating file name of each record, and if you know all files are small enough.
As of Spark2.1.1 following is the code for textFile.
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path) }
Which internally uses hadoopFile to read either local files, HDFS files, and S3 using the pattern like file://
, hdfs://
, and s3a://
Where as WholeTextFile the syntax is as below
def wholeTextFiles(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope
If we observe the syntax for the both methods are equal, but textfile is useful to read the files, where as wholeTextFiles is used to read the directories of small files. How ever we can also use larger files but performance may effect.
So when you want to deal with large files textFile is better option, whereas if we want to deal with directory of smaller files wholeTextFile is better
textfile() reads a text file and returns an RDD of Strings. For example sc.textFile("/mydata.txt") will create RDD in which each individual line is an element.
wholeTextFile() reads a directory of text files and returns pairRDD. For example, if there are few files in a directory, the wholeTextFile() method will create pair RDD with filename and path as key, and value being the whole file as string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With