Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the number of partitions affect `wholeTextFiles` and `textFiles`?

In the spark, I understand how to use wholeTextFiles and textFiles, but I'm not sure which to use when. Here is what I know so far:

  • When dealing with files that are not split by line, one should use wholeTextFiles, otherwise use textFiles.

I would think that by default, wholeTextFiles and textFiles partition by file content, and by lines, respectively. But, both of them allow you to change the parameter minPartitions.

So, how does changing the partitions affect how these are processed?

For example, say I have one very large file with 100 lines. What would be the difference between processing it as wholeTextFiles with 100 partiions, and processing it as textFile (which partitions it line by line) using the default of parition 100.

What is the difference between these?

like image 237
makansij Avatar asked Nov 25 '15 00:11

makansij


1 Answers

For reference, wholeTextFiles uses WholeTextFileInputFormat which extends CombineFileInputFormat.

A couple of notes on wholeTextFiles.

  • Each record in the RDD returned by wholeTextFiles has the file name and the entire contents of the file. This means that a file cannot be split (at all).
  • Because it extends CombineFileInputFormat, it will try to combine groups of smaller files into one partition.

If I have two small files in a directory, it is possible that both files will end up in a single partition. If I set minPartitions=2, then I will likely get two partitions back instead.

Now if I were to set minPartitions=3, I will still get back two partitions because the contract for wholeTextFiles is that each record in the RDD contain an entire file.

like image 134
Mike Park Avatar answered Nov 14 '22 21:11

Mike Park