I wanted to ask is there any significant difference in data partitioning when working with Hadoop/MapReduce and Spark? They both work on HDFS(TextInputFormat) so it should be same in theory.
Are there any cases where the there procedure of data partitioning can differ? Any insights would be very helpful to my study.
Thanks
Is any significant difference in data partitioning when working with Hadoop/mapreduce and Spark?
Spark supports all hadoop I/O formats as it uses same Hadoop InputFormat APIs along with it's own formatters. So, Spark input partitions works same way as Hadoop/MapReduce input splits by default. Data size in a partition can be configurable at run time and It provides transformation like repartition
, coalesce
, and repartitionAndSortWithinPartition
can give you direct control over the number of partitions being computed.
Are there any cases where their procedure of data partitioning can differ?
Apart from Hadoop, I/O APIs Spark does have some other intelligent I/O Formats(Ex: Databricks CSV and NoSQL DB Connectors) which will directly return DataSet/DateFrame
(more high-level things on top of RDD) which are spark specific.
Key points on spark partitions when reading data from Non-Hadoop sources
fs.s3n.block.size
or fs.s3.block.size
. spark.cassandra.input.split.size_in_mb
. spark.mongodb.input.partitionerOptions.partitionSizeMB
. max(sc.defaultParallelism, total_data_size / data_block_size)
.
some times number of available cores in the cluster also imflunce the number of partitions like sc.parallelize()
without partitions param.Read more.. link1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With