DataFrame.count() == 0 Vs DataFrame.rdd.isEmpty(): please compare for execution speed

Question

DataFrame.count() requires materializing the query which is costly. Is there a non-negligible cost [of materialization] to DataFrame.rdd and how does that compare to the former?

Is the latter faster to execute?

Stephen · Accepted Answer

.isEmpty() is best. Its shorter and less error prone.

Update

Spark code explains in much better way!! in RDD class isEmpty() is

def isEmpty(): Boolean = withScope {
  partitions.length == 0 || take(1).length == 0
}

T. Gawęda · Answer

The fastest way should be:

datset.limit(1).take(1).length > 0

This is similar approach to RDD's isEmpty, but does not require deserialization like call to .rdd

However it's hard to say if it's better in your case - we don't know the requirements

DataFrame.count() == 0 Vs DataFrame.rdd.isEmpty(): please compare for execution speed

Tags:

scala

apache-spark

apache-spark-sql

Constantine

2 Answers

Update

Stephen

T. Gawęda

Recent Activity

Donate For Us

DataFrame.count() == 0 Vs DataFrame.rdd.isEmpty(): please compare for execution speed

Tags:

scala

apache-spark

apache-spark-sql

Constantine

2 Answers

Update

Stephen

T. Gawęda

Related questions

Recent Activity

Donate For Us