Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: persist and repartition order

I have the following code:

val data = input.map{... }.persist(StorageLevel.MEMORY_ONLY_SER).repartition(2000)

I am wondering what's the difference if I do the repartition first like:

val data = input.map{... }.repartition(2000).persist(StorageLevel.MEMORY_ONLY_SER)

Is there a difference in the order of calling reparation and persist? Thanks!

like image 368
Edamame Avatar asked Nov 12 '15 23:11

Edamame


People also ask

Which is faster repartition or coalesce?

Repartition works by creating new partitions and doing a full shuffle to move data around. Results in more or less equal sized partitions. Since a full shuffle takes place, repartition is less performant than coalesce.

Is Spark persist lazy?

Caching or persisting of Spark DataFrame or Dataset is a lazy operation, meaning a DataFrame will not be cached until you trigger an action.

What does persist () do in Spark?

Spark RDD persistence is an optimization technique which saves the result of RDD evaluation in cache memory. Using this we save the intermediate result so that we can use it further if required. It reduces the computation overhead.


1 Answers

Yes, there is a difference.

In the first case you get persist RDD after map phase. It means that every time data is accessed it will trigger repartition.

In the second case you cache after repartitioning. When data is accessed, and has been previously materialized, there is no additional work to do.

To prove lets make an experiment:

import  org.apache.spark.storage.StorageLevel

val data1 = sc.parallelize(1 to 10, 8)
  .map(identity)
  .persist(StorageLevel.MEMORY_ONLY_SER)
  .repartition(2000)
data1.count()

val data2 = sc.parallelize(1 to 10, 8)
  .map(identity)
  .repartition(2000)
  .persist(StorageLevel.MEMORY_ONLY_SER)
data2.count()

and take a look at the storage info:

sc.getRDDStorageInfo

// Array[org.apache.spark.storage.RDDInfo] = Array(
//   RDD "MapPartitionsRDD" (17) StorageLevel:
//       StorageLevel(false, true, false, false, 1);
//     CachedPartitions: 2000; TotalPartitions: 2000; MemorySize: 8.6 KB; 
//     ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B,
//   RDD "MapPartitionsRDD" (7) StorageLevel:
//      StorageLevel(false, true, false, false, 1);
//    CachedPartitions: 8; TotalPartitions: 8; MemorySize: 668.0 B; 
//    ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

As you can see there are two persisted RDDs, one with 2000 partitions, and one with 8.

like image 144
zero323 Avatar answered Sep 24 '22 23:09

zero323