Typically spark files are saved in multiple parts, allowing each worker to read different files. is there a similar solution when working on a single files? s3 provides the select API that should allow this kind of behaviour.
spark appears to support this API (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html), but this appears to relate only for optimising queries, not for parallelising reading
Edit: this is now out of date and depends on the type of CSV. Some CSV's allow new lines within columns. These are un splitable. CSVs that do not, and guarantee that a newlines only represent a new row can be split. In this case spark starts reading at a numeric address in the file, reads to the next endline and then continues reading the start of the this new row.
FYI csv's are inherently single threaded. There is no extra information in a csv file that tells the reader where any row starts without reading the whole file from the start.
If you want multiple readers on the same file use a format like Parquet which has row groups with an explicitly defined start position defined in the footer that can be read by independent readers. When spark goes to read the parquet file it will split out row groups into separate tasks. Ultimately having appropriately sized files is very important for spark performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With