Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it the driver or the workers who reads the text file when sc.textfile is used?

I am wondering how is sc.textfile used in Spark. My guess is that the driver reads a portion of the file at a time, and distributes the read text to the workers, to process. Or is it that the workers read the text directly themselves from the file without the involvement of the driver?

like image 353
pythonic Avatar asked Jun 07 '17 22:06

pythonic


1 Answers

The driver looks at file metadata - check that it exists, check what files are in the directory if it's a directory, and check their sizes. It then sends tasks to workers, who do the actual reading of the file contents. The communication is essentially "you read this file, starting at this offset, for this length."

HDFS splits large files into blocks, and spark will (usually/often) split tasks according to blocks, so the process of skipping to that offset will be efficient.

Other filesystems tend to operate similarly, though not always. Compression can also mess with this process if the codec is not splittable.

like image 141
Joe K Avatar answered Sep 23 '22 09:09

Joe K