How does the number of partitions affect `wholeTextFiles` and `textFiles`?

Question

In the spark, I understand how to use wholeTextFiles and textFiles, but I'm not sure which to use when. Here is what I know so far:

When dealing with files that are not split by line, one should use wholeTextFiles, otherwise use textFiles.

I would think that by default, wholeTextFiles and textFiles partition by file content, and by lines, respectively. But, both of them allow you to change the parameter minPartitions.

So, how does changing the partitions affect how these are processed?

For example, say I have one very large file with 100 lines. What would be the difference between processing it as wholeTextFiles with 100 partiions, and processing it as textFile (which partitions it line by line) using the default of parition 100.

What is the difference between these?

Mike Park · Accepted Answer

For reference, wholeTextFiles uses WholeTextFileInputFormat which extends CombineFileInputFormat.

A couple of notes on wholeTextFiles.

Each record in the RDD returned by wholeTextFiles has the file name and the entire contents of the file. This means that a file cannot be split (at all).
Because it extends CombineFileInputFormat, it will try to combine groups of smaller files into one partition.

If I have two small files in a directory, it is possible that both files will end up in a single partition. If I set minPartitions=2, then I will likely get two partitions back instead.

Now if I were to set minPartitions=3, I will still get back two partitions because the contract for wholeTextFiles is that each record in the RDD contain an entire file.

How does the number of partitions affect `wholeTextFiles` and `textFiles`?

Tags:

python

apache-spark

pyspark

makansij

1 Answers

Mike Park

Recent Activity

Donate For Us

How does the number of partitions affect `wholeTextFiles` and `textFiles`?

Tags:

python

apache-spark

pyspark

makansij

1 Answers

Mike Park

Related questions

Recent Activity

Donate For Us