Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark textFile vs wholeTextFiles

I understand the basic theory of textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.

Now, from a technical point of view, what's the difference between :

val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions

and

val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions

In both methods I'm generating 8 partitions. So why should I use wholeTextFiles in the first place, and what's its benefit over textFile?

like image 567
Dan Avatar asked Nov 06 '17 04:11

Dan


People also ask

What is spark textFile?

textFile (name, minPartitions=None, use_unicode=True)[source] Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. The text files must be encoded as UTF-8.

How do I read multiple text files in RDD?

Spark Read multiple text files into a single RDD When you know the names of the multiple files you would like to read, just input all file names with comma separator in order to create a single RDD. This read file text01. txt & text02. txt files and outputs below content.


4 Answers

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile.

When reading uncompressed files with textFile, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles should be used.

wholeTextFiles will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.

like image 104
Shaido Avatar answered Oct 17 '22 02:10

Shaido


textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values

That's not accurate:

  1. textFile loads one or more files, with each line as a record in the resulting RDD. A single file might be split into several partitions if the file is large enough (depends on the number of partitions requested, Spark's default number of partitions, and the underlying File System). When loading multiple files at once, this operation "loses" the relation between a record and the file that contained it - i.e. there's no way to know which file contained which line. The order of the records in the RDD will follow the alphabetical order of files, and the order of records within the files (order is not "lost").

  2. wholeTextFiles preserves the relation between data and the files that contained it, by loading the data into a PairRDD with one record per input file. The record will have the form (fileName, fileContent). This means that loading large files is risky (might cause bad performance or OutOfMemoryError since each file will necessarily be stored on a single node). Partitioning is done based on user input or Spark's configuration - with multiple files potentially loaded into a single partition.

Generally speaking, textFile serves the common use case of just loading a lot of data (regardless of how it's broken-down into files). readWholeFiles should only be used if you actually need to know the originating file name of each record, and if you know all files are small enough.

like image 29
Tzach Zohar Avatar answered Oct 17 '22 02:10

Tzach Zohar


As of Spark2.1.1 following is the code for textFile.

def textFile(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()

hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
  minPartitions).map(pair => pair._2.toString).setName(path)  }

Which internally uses hadoopFile to read either local files, HDFS files, and S3 using the pattern like file:// , hdfs://, and s3a://

Where as WholeTextFile the syntax is as below

def wholeTextFiles(
  path: String,
  minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope 

If we observe the syntax for the both methods are equal, but textfile is useful to read the files, where as wholeTextFiles is used to read the directories of small files. How ever we can also use larger files but performance may effect.
So when you want to deal with large files textFile is better option, whereas if we want to deal with directory of smaller files wholeTextFile is better

like image 3
Sainagaraju Vaduka Avatar answered Oct 17 '22 02:10

Sainagaraju Vaduka


  1. textfile() reads a text file and returns an RDD of Strings. For example sc.textFile("/mydata.txt") will create RDD in which each individual line is an element.

  2. wholeTextFile() reads a directory of text files and returns pairRDD. For example, if there are few files in a directory, the wholeTextFile() method will create pair RDD with filename and path as key, and value being the whole file as string.

like image 2
KayV Avatar answered Oct 17 '22 02:10

KayV