Spark RDD- map vs mapPartitions

Tags:

I read through theoretical differences between map and mapPartitions, & 'm much clear when to use them in varied situations.

But my problem described below is more based upon GC activity & Memory (RAM). Please read below for the problem:-

=> I wrote a map function to convert Row to String. So, an input of RDD[org.apache.spark.sql.Row] would be mapped to RDD[String]. But with this approach map object would be created for every row of an RDD. Thus creation of such large number of objects may increase GC activity.

=> To resolve above, I thought of using mapPartitions. So, that number of objects become equivalent to number of partitions. mapPartitions gives Iterator as an input and accepts to return and java.lang.Iterable. But most of the Iterable like Array, List, etc are in memory. So, if I have huge amount of data then would creating a Iterable this way can lead to out of Memory ? or Is there any other collection (java or scala) that should be utilized here (to spill to Disk in case memory starts to fill)? or should we only use mapPartitions in case RDD is completely in Memory?

Thanks in advance. Any help would be greatly appreciated.

224

asked Dec 01 '16 12:12

dinesh028

2 Answers

If you think about JavaRDD.mapPartitions it takes FlatMapFunction (or some variant like DoubleFlatMapFunction) which is expected to return Iterator not Iterable. If underlaying collection is lazy then you have nothing to worry about.

RDD.mapPartitions takes a functions from Iterator to Iterator.

I general if you use reference data you can replace mapPartitions with map and use static member to store data. This will have the same footprint and will be easier to write.

180

answered Oct 13 '22 06:10

user7236328

to answer your question about mapPartition(f: Iterator => Iterator). it is lazy and also does not hold the whole partition in mem. Spark will use this(we can consider it to be a Functor in FP term) Iterator => Iterator function and recompile it into its own code to execute. if partition is too big, it will spill to disk before next shuffle point. so don't worry about it.

one thing that needs to mention is, you can force your function to materialize data into mem, simply by doing:

rdd.mapPartition(
  partitionIter => {
    partitionIter.map(do your logic).toList.toIterator
  }
)

toList will force Spark to materialize the data for the whole partition into mem, therefore watch out for this, because ops like toList will break the laziness of the function chain.

answered Oct 13 '22 06:10

linehrr

Related questions
                            
                                IntelliJ 14 keeps switching encoding to UTF-8
                            
                                Implications of Old version of Java.exe (7.1) using new version of runtime(8.x)
                            
                                Java 9 REPL for running app
                            
                                JavaFX on ARM running jdk1.8.0
                            
                                Persist issue with a spring batch ItemWriter using a JPA repository
                            
                                Out of memory error when writing out spark dataframes to parquet format
                            
                                How to Subtract number of days from current date in HQL query
                            
                                Accessing Spring beans in Runnable thread
                            
                                Instantiate Java lambda function by name
                            
                                Apache Ignite Availability Issue w/Custom CacheStoreAdapter
                            
                                Mapping an object to an immutable object with builder (using immutables annotation processor) in mapstruct
                            
                                Jax RS Authorization
                            
                                CORS Play Framework
                            
                                How to avoid switch-case statements in Java [duplicate]
                            
                                WatchService fires ENTRY_MODIFY sometimes twice and sometimes once
                            
                                Read application.properties from non-component class like pojo or singleton
                            
                                Keycloak giving invalid redirect uri error
                            
                                How to determine where method was called
                            
                                Why doesn't Java dependency on a constant lead to recompilation?
                            
                                Finding a pattern in a set of values in Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark RDD- map vs mapPartitions

Tags:

java

garbage-collection

scala

apache-spark

dinesh028

People also ask

2 Answers

user7236328

linehrr

Recent Activity

Donate For Us