I've come across the glom()
method on RDD. As per the documentation
Return an RDD created by coalescing all elements within each partition into an array
Does glom
shuffle the data across the partitions or does it only return the partition data as an array? In the latter case, I believe that the same can be achieved using mapPartitions
.
I would also like to know if there are any use cases that benefit from glom
.
glom ()[source] Return an RDD created by coalescing all elements within each partition into a list.
mapPartitions() – This is precisely the same as map(); the difference being, Spark mapPartitions() provides a facility to do heavy initializations (for example, Database connection) once for each partition instead of doing it on every DataFrame row.
The glom() function returns an RDD that is created by grouping all elements within each partition into a list(called tuple as it is an immutable list). You can sort it like this - rdd = sc.parallelize([1, 2, 3, 4], 2) sorted(rdd.glom().collect()) [[1, 2], [3, 4]]
glom is a new approach to working with data in Python, featuring: Path-based access for nested structures. Declarative data transformation using lightweight, Pythonic specifications. Readable, meaningful error messages.
Does
glom
shuffle the data across partitions
No, it doesn't
If this is the second case I believe that the same can be achieved using mapPartitions
It can:
rdd.mapPartitions(iter => Iterator(_.toArray))
but the same thing applies to any non shuffling transformation like map
, flatMap
or filter
.
if there are any use cases which benefit from glob.
Any situation where you need to access partition data in a form that is traversable more than once.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With