Spark: persist and repartition order

Tags:

I have the following code:

val data = input.map{... }.persist(StorageLevel.MEMORY_ONLY_SER).repartition(2000)

I am wondering what's the difference if I do the repartition first like:

val data = input.map{... }.repartition(2000).persist(StorageLevel.MEMORY_ONLY_SER)

Is there a difference in the order of calling reparation and persist? Thanks!

368

asked Nov 12 '15 23:11

Edamame

1 Answers

Yes, there is a difference.

In the first case you get persist RDD after map phase. It means that every time data is accessed it will trigger repartition.

In the second case you cache after repartitioning. When data is accessed, and has been previously materialized, there is no additional work to do.

To prove lets make an experiment:

import  org.apache.spark.storage.StorageLevel

val data1 = sc.parallelize(1 to 10, 8)
  .map(identity)
  .persist(StorageLevel.MEMORY_ONLY_SER)
  .repartition(2000)
data1.count()

val data2 = sc.parallelize(1 to 10, 8)
  .map(identity)
  .repartition(2000)
  .persist(StorageLevel.MEMORY_ONLY_SER)
data2.count()

and take a look at the storage info:

sc.getRDDStorageInfo

// Array[org.apache.spark.storage.RDDInfo] = Array(
//   RDD "MapPartitionsRDD" (17) StorageLevel:
//       StorageLevel(false, true, false, false, 1);
//     CachedPartitions: 2000; TotalPartitions: 2000; MemorySize: 8.6 KB; 
//     ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B,
//   RDD "MapPartitionsRDD" (7) StorageLevel:
//      StorageLevel(false, true, false, false, 1);
//    CachedPartitions: 8; TotalPartitions: 8; MemorySize: 668.0 B; 
//    ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

As you can see there are two persisted RDDs, one with 2000 partitions, and one with 8.

144

answered Sep 24 '22 23:09

zero323

Related questions
                            
                                How to create a Row from a List or Array in Spark using Scala
                            
                                How to find out the amount of memory pyspark has from iPython interface?
                            
                                Spark Submit fails with java.lang.NoSuchMethodError: scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
                            
                                Apache Spark: What is the equivalent implementation of RDD.groupByKey() using RDD.aggregateByKey()?
                            
                                How to name file when saveAsTextFile in spark?
                            
                                How to access broadcasted DataFrame in Spark
                            
                                Spark Streaming from Kafka has error numRecords must not be negative
                            
                                Get the max value for each key in a Spark RDD
                            
                                Scala and Spark UDF function
                            
                                Structured Streaming exception when using append output mode with watermark
                            
                                How to know the number of Spark jobs and stages in (broadcast) join query?
                            
                                What is the =!= operator in Scala?
                            
                                Broadcast hash join - Iterative
                            
                                Spark non-serializable exception when parsing JSON with json4s
                            
                                How to select a same-size stratified sample from a dataframe in Apache Spark?
                            
                                PySpark: Subtract Two Timestamp Columns and Give Back Difference in Minutes (Using F.datediff gives back only whole days)
                            
                                KafkaUtils class not found in Spark streaming
                            
                                Write RDD as textfile using Apache Spark
                            
                                How can I efficiently join a large rdd to a very large rdd in spark?
                            
                                Apache Spark Running Locally Giving Refused Connection Error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: persist and repartition order

Tags:

persist

apache-spark

rdd

partition

Edamame

People also ask

1 Answers

zero323

Recent Activity

Donate For Us