Is it the driver or the workers who reads the text file when sc.textfile is used?

Question

I am wondering how is sc.textfile used in Spark. My guess is that the driver reads a portion of the file at a time, and distributes the read text to the workers, to process. Or is it that the workers read the text directly themselves from the file without the involvement of the driver?

Joe K · Accepted Answer

The driver looks at file metadata - check that it exists, check what files are in the directory if it's a directory, and check their sizes. It then sends tasks to workers, who do the actual reading of the file contents. The communication is essentially "you read this file, starting at this offset, for this length."

HDFS splits large files into blocks, and spark will (usually/often) split tasks according to blocks, so the process of skipping to that offset will be efficient.

Other filesystems tend to operate similarly, though not always. Compression can also mess with this process if the codec is not splittable.

Is it the driver or the workers who reads the text file when sc.textfile is used?

Tags:

file

io

scala

apache-spark

hadoop

pythonic

1 Answers

Joe K

Recent Activity

Donate For Us

Is it the driver or the workers who reads the text file when sc.textfile is used?

Tags:

file

io

scala

apache-spark

hadoop

pythonic

1 Answers

Joe K

Related questions

Recent Activity

Donate For Us