If the file is loaded from HDFS by default spark creates one partition per block. But how does spark decide partitions when a file is loaded from S3 bucket?
Even when reading a file from an S3 bucket, Spark (by default) creates one partition per block i.e. total no of partitions = total-file-size / block-size.
The value of block size for S3 is available as a property in Hadoop's core-site.xml file which is used by Spark:
<property>
<name>fs.s3a.block.size</name>
<value>32M</value>
<description>Block size to use when reading files using s3a: file system.
</description>
</property>
Unlike HDFS, AWS S3 is not a file system. It is an object store. The S3A connector makes S3 look like a file system.
Please checkout the documentation for more details.
See the code of org.apache.hadoop.mapred.FileInputFormat.getSplits()
.
Block size depends on S3 file system implementation (see FileStatus.getBlockSize()
). E.g. S3AFileStatus
just set it equals to 0
(and then FileInputFormat.computeSplitSize()
comes into play).
Also, you don't get splits if your InputFormat is not splittable :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With