DataFrame.count() requires materializing the query which is costly. Is there a non-negligible cost [of materialization] to DataFrame.rdd and how does that compare to the former?
Is the latter faster to execute?
.isEmpty()is best. Its shorter and less error prone.
Spark code explains in much better way!! in RDD class isEmpty() is
def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
}
The fastest way should be:
datset.limit(1).take(1).length > 0
This is similar approach to RDD's isEmpty, but does not require deserialization like call to .rdd
However it's hard to say if it's better in your case - we don't know the requirements
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With