There is not an <code>isEmpty</code> method on RDD's, so what is the most efficient way of testing if an RDD is empty?

<code>RDD.isEmpty()</code> will be part of Spark 1.3.0. Based on suggestions in this apache mail-thread and later some comments to this answer, I have done some small local experiments. The best method is using <code>take(1).length==0</code>. <pre class="prettyprint"><code>def isEmpty[T](rdd : RDD[T]) = { rdd.take(1).length == 0 } </code></pre> It should run in <code>O(1)</code> except when the RDD is empty, in which case it is linear in the number of partitions. Thanks to Josh Rosen and Nick Chammas to point me to this. Note: This fails if the RDD is of type <code>RDD[Nothing]</code> e.g. <code>isEmpty(sc.parallelize(Seq()))</code>, but this is likely not a problem in real life. <code>isEmpty(sc.parallelize(Seq[Any]()))</code> works fine. <hr> <h3>Edits:</h3> <ul> <li> Edit 1: Added <code>take(1)==0</code> method, thanks to comments.</li> </ul> My original suggestion: Use <code>mapPartitions</code>. <pre class="prettyprint"><code>def isEmpty[T](rdd : RDD[T]) = { rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_) } </code></pre> It should scale in the number of partitions and is not nearly as clean as <code>take(1)</code>. It is however robust to RDD's of type <code>RDD[Nothing]</code>. <hr> <h3>Experiments:</h3> I used this code for the timings. <pre class="prettyprint"><code>def time(n : Long, f : (RDD[Long]) => Boolean): Unit = { val start = System.currentTimeMillis() val rdd = sc.parallelize(1L to n, numSlices = 100) val result = f(rdd) printf("Time: " + (System.currentTimeMillis() - start) + " Result: " + result) } time(1000000000L, rdd => rdd.take(1).length == 0L) time(1000000000L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)) time(1000000000L, rdd => rdd.count() == 0L) time(1000000000L, rdd => rdd.takeSample(true, 1).isEmpty) time(1000000000L, rdd => rdd.fold(0)(_ + _) == 0L) time(1L, rdd => rdd.take(1).length == 0L) time(1L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)) time(1L, rdd => rdd.count() == 0L) time(1L, rdd => rdd.takeSample(true, 1).isEmpty) time(1L, rdd => rdd.fold(0)(_ + _) == 0L) time(0L, rdd => rdd.take(1).length == 0L) time(0L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)) time(0L, rdd => rdd.count() == 0L) time(0L, rdd => rdd.takeSample(true, 1).isEmpty) time(0L, rdd => rdd.fold(0)(_ + _) == 0L) </code></pre> On my local machine with 3 worker cores I got these results <pre class="prettyprint"><code>Time: 21 Result: false Time: 75 Result: false Time: 8664 Result: false Time: 18266 Result: false Time: 23836 Result: false Time: 113 Result: false Time: 101 Result: false Time: 68 Result: false Time: 221 Result: false Time: 46 Result: false Time: 79 Result: true Time: 93 Result: true Time: 79 Result: true Time: 100 Result: true Time: 64 Result: true </code></pre>

Spark: Efficient way to test if an RDD is empty

1 Answers

RDD.isEmpty() will be part of Spark 1.3.0.

Based on suggestions in this apache mail-thread and later some comments to this answer, I have done some small local experiments. The best method is using take(1).length==0.

def isEmpty[T](rdd : RDD[T]) = {   rdd.take(1).length == 0  }

It should run in O(1) except when the RDD is empty, in which case it is linear in the number of partitions.

Thanks to Josh Rosen and Nick Chammas to point me to this.

Note: This fails if the RDD is of type RDD[Nothing] e.g. isEmpty(sc.parallelize(Seq())), but this is likely not a problem in real life. isEmpty(sc.parallelize(Seq[Any]())) works fine.

Edits:

Edit 1: Added take(1)==0 method, thanks to comments.

My original suggestion: Use mapPartitions.

def isEmpty[T](rdd : RDD[T]) = {   rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)  }

It should scale in the number of partitions and is not nearly as clean as take(1). It is however robust to RDD's of type RDD[Nothing].

Experiments:

I used this code for the timings.

def time(n : Long, f : (RDD[Long]) => Boolean): Unit = {   val start = System.currentTimeMillis()   val rdd = sc.parallelize(1L to n, numSlices = 100)   val result = f(rdd)   printf("Time: " + (System.currentTimeMillis() - start) + "   Result: " + result) }  time(1000000000L, rdd => rdd.take(1).length == 0L) time(1000000000L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)) time(1000000000L, rdd => rdd.count() == 0L) time(1000000000L, rdd => rdd.takeSample(true, 1).isEmpty) time(1000000000L, rdd => rdd.fold(0)(_ + _) == 0L)  time(1L, rdd => rdd.take(1).length == 0L) time(1L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)) time(1L, rdd => rdd.count() == 0L) time(1L, rdd => rdd.takeSample(true, 1).isEmpty) time(1L, rdd => rdd.fold(0)(_ + _) == 0L)  time(0L, rdd => rdd.take(1).length == 0L) time(0L, rdd => rdd.mapPartitions(it => Iterator(!it.hasNext)).reduce(_&&_)) time(0L, rdd => rdd.count() == 0L) time(0L, rdd => rdd.takeSample(true, 1).isEmpty) time(0L, rdd => rdd.fold(0)(_ + _) == 0L)

On my local machine with 3 worker cores I got these results

Time:    21   Result: false Time:    75   Result: false Time:  8664   Result: false Time: 18266   Result: false Time: 23836   Result: false  Time:   113   Result: false Time:   101   Result: false Time:    68   Result: false Time:   221   Result: false Time:    46   Result: false  Time:    79   Result: true Time:    93   Result: true Time:    79   Result: true Time:   100   Result: true Time:    64   Result: true

146

answered Sep 25 '22 13:09

Tobber

Related questions
                            
                                How to split sentence into words separated by multiple spaces?
                            
                                foldLeft v. foldRight - does it matter?
                            
                                How to install an older version of scala
                            
                                Simple question about tuple of scala
                            
                                How to split a List[Either[A, B]]
                            
                                How to get ScalaTest to populate test runtimes in the xml report?
                            
                                Why is Array.slice so (shockingly!) slow?
                            
                                How to integrate Scala into core Android platform?
                            
                                Traits vs. Interfaces vs. Mixins?
                            
                                Git workflow - changing branch and slow recompiles
                            
                                Scala inherit parameterized constructor
                            
                                What is happening with 0.asInstanceOf[B] in Scala reduceLeft implementation
                            
                                How to define schema for custom type in Spark SQL?
                            
                                In ScalaTest is there any difference between `should`, `can`, `must`
                            
                                Scala parser combinators vs ANTLR/Java generated parser?
                            
                                What is open recursion?
                            
                                Scala 2.8 CanBuildFrom
                            
                                What is the concept of "weak conformance" in Scala?
                            
                                What is the cost of creating actors in Akka?
                            
                                How to make aggregations with slick

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: Efficient way to test if an RDD is empty

Tags:

scala

apache-spark

rdd

Tobber

People also ask

1 Answers

Edits:

Experiments:

Tobber

Recent Activity

Donate For Us