Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DataFrame.count() == 0 Vs DataFrame.rdd.isEmpty(): please compare for execution speed

DataFrame.count() requires materializing the query which is costly. Is there a non-negligible cost [of materialization] to DataFrame.rdd and how does that compare to the former?

Is the latter faster to execute?

like image 846
Constantine Avatar asked Apr 23 '26 04:04

Constantine


2 Answers

.isEmpty() is best. Its shorter and less error prone.

Update

Spark code explains in much better way!! in RDD class isEmpty() is

def isEmpty(): Boolean = withScope {
  partitions.length == 0 || take(1).length == 0
}
like image 53
Stephen Avatar answered Apr 25 '26 20:04

Stephen


The fastest way should be:

datset.limit(1).take(1).length > 0

This is similar approach to RDD's isEmpty, but does not require deserialization like call to .rdd

However it's hard to say if it's better in your case - we don't know the requirements

like image 45
T. Gawęda Avatar answered Apr 25 '26 19:04

T. Gawęda