When I use Spark to read multiple files from S3 (e.g. a directory with many Parquet files) -
Does the logical partitioning happen at the beginning, then each executor downloads the data directly (on the worker node)?
Or does the driver download the data (partially or fully) and only then partitions and sends the data to the executors?
Also, will the partitioning default to the same partitions that were used for write (i.e. each file = 1 partition)?
Data on S3 is external to HDFS obviously.
You can read from S3 by providing a path, or paths, or using Hive Metastore - if you have updated this via creating DDL for External S3 table, and using MSCK for partitions, or ALTER TABLE table_name RECOVER PARTITIONS for Hive on EMR.
If you use:
val df = spark.read.parquet("/path/to/parquet/file.../...")
then there is no guarantee on partitioning and it depends on various settings - see Does Spark maintain parquet partitioning on read?, noting APIs evolve and get better.
But, this:
val df = spark.read.parquet("/path/to/parquet/file.../.../partitioncolumn=*")
will return partitions over executors in some manner as per your saved partition structure, a bit like SPARK bucketBy.
The Driver only gets the metadata if specifying S3 directly.
In your terms:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With