I've come across the <code>glom()</code> method on RDD. As per the documentation <blockquote> Return an RDD created by coalescing all elements within each partition into an array </blockquote> Does <code>glom</code> shuffle the data across the partitions or does it only return the partition data as an array? In the latter case, I believe that the same can be achieved using <code>mapPartitions</code>. I would also like to know if there are any use cases that benefit from <code>glom</code>.

<blockquote> Does <code>glom</code> shuffle the data across partitions </blockquote> No, it doesn't <blockquote> If this is the second case I believe that the same can be achieved using mapPartitions </blockquote> It can: <pre class="prettyprint"><code>rdd.mapPartitions(iter => Iterator(_.toArray)) </code></pre> but the same thing applies to any non shuffling transformation like <code>map</code>, <code>flatMap</code> or <code>filter</code>. <blockquote> if there are any use cases which benefit from glob. </blockquote> Any situation where you need to access partition data in a form that is traversable more than once.

What is a glom?. How it is different from mapPartitions?

1 Answers

Does glom shuffle the data across partitions

No, it doesn't

If this is the second case I believe that the same can be achieved using mapPartitions

It can:

rdd.mapPartitions(iter => Iterator(_.toArray))

but the same thing applies to any non shuffling transformation like map, flatMap or filter.

if there are any use cases which benefit from glob.

Any situation where you need to access partition data in a form that is traversable more than once.

134

answered Oct 23 '22 09:10

zero323

Related questions
                            
                                How to read records in JSON format from Kafka using Structured Streaming?
                            
                                'map-side' aggregation in Spark
                            
                                Spark MLlib LDA, how to infer the topics distribution of a new unseen document?
                            
                                How to convert spark DataFrame to RDD mllib LabeledPoints?
                            
                                Spark simpler value_counts
                            
                                Spark from_json with dynamic schema
                            
                                How to sort within partitions (and avoid sort across the partitions) using RDD API?
                            
                                How to save latest offset that Spark consumed to ZK or Kafka and can read back after restart
                            
                                Create labeledPoints from Spark DataFrame in Python
                            
                                Convert an RDD to iterable: PySpark?
                            
                                How to fully utilize all Spark nodes in cluster?
                            
                                When to use Kryo serialization in Spark?
                            
                                Spark' Dataset unpersist behaviour
                            
                                Julia on Hadoop? [closed]
                            
                                Spark vs Flink low memory available
                            
                                Spark : multiple spark-submit in parallel
                            
                                How to add source file name to each row in Spark?
                            
                                --files option in pyspark not working
                            
                                Spark: how to use SparkContext.textFile for local file system
                            
                                Applying function to Spark Dataframe Column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is a glom?. How it is different from mapPartitions?

Tags:

apache-spark

rdd

nagendra

People also ask

1 Answers

zero323

Recent Activity

Donate For Us