In the spark, I understand how to use wholeTextFiles
and textFiles
, but I'm not sure which to use when. Here is what I know so far:
wholeTextFiles
, otherwise use textFiles
. I would think that by default, wholeTextFiles
and textFiles
partition by file content, and by lines, respectively. But, both of them allow you to change the parameter minPartitions
.
So, how does changing the partitions affect how these are processed?
For example, say I have one very large file with 100 lines. What would be the difference between processing it as wholeTextFiles
with 100 partiions, and processing it as textFile
(which partitions it line by line) using the default of parition 100.
What is the difference between these?
For reference, wholeTextFiles
uses WholeTextFileInputFormat
which extends CombineFileInputFormat.
A couple of notes on wholeTextFiles
.
wholeTextFiles
has the file name and the entire contents of the file. This means that a file cannot be split (at all).CombineFileInputFormat
, it will try to combine groups of smaller files into one partition.If I have two small files in a directory, it is possible that both files will end up in a single partition. If I set minPartitions=2
, then I will likely get two partitions back instead.
Now if I were to set minPartitions=3
, I will still get back two partitions because the contract for wholeTextFiles
is that each record in the RDD contain an entire file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With