I am wondering how is sc.textfile used in Spark. My guess is that the driver reads a portion of the file at a time, and distributes the read text to the workers, to process. Or is it that the workers read the text directly themselves from the file without the involvement of the driver?
The driver looks at file metadata - check that it exists, check what files are in the directory if it's a directory, and check their sizes. It then sends tasks to workers, who do the actual reading of the file contents. The communication is essentially "you read this file, starting at this offset, for this length."
HDFS splits large files into blocks, and spark will (usually/often) split tasks according to blocks, so the process of skipping to that offset will be efficient.
Other filesystems tend to operate similarly, though not always. Compression can also mess with this process if the codec is not splittable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With