is it possible in spark to read large s3 csv files in parallel?

Question

Typically spark files are saved in multiple parts, allowing each worker to read different files. is there a similar solution when working on a single files? s3 provides the select API that should allow this kind of behaviour.

spark appears to support this API (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html), but this appears to relate only for optimising queries, not for parallelising reading

Andrew Long · Accepted Answer

Edit: this is now out of date and depends on the type of CSV. Some CSV's allow new lines within columns. These are un splitable. CSVs that do not, and guarantee that a newlines only represent a new row can be split. In this case spark starts reading at a numeric address in the file, reads to the next endline and then continues reading the start of the this new row.

FYI csv's are inherently single threaded. There is no extra information in a csv file that tells the reader where any row starts without reading the whole file from the start.

If you want multiple readers on the same file use a format like Parquet which has row groups with an explicitly defined start position defined in the footer that can be read by independent readers. When spark goes to read the parquet file it will split out row groups into separate tasks. Ultimately having appropriately sized files is very important for spark performance.

is it possible in spark to read large s3 csv files in parallel?

Tags:

amazon-s3

apache-spark

amazon-emr

Ophir Yoktan

1 Answers

Andrew Long

Recent Activity

Donate For Us

is it possible in spark to read large s3 csv files in parallel?

Tags:

amazon-s3

apache-spark

amazon-emr

Ophir Yoktan

1 Answers

Andrew Long

Related questions

Recent Activity

Donate For Us