For the below 2 approaches to check if a dataframe is empty:
df.isEmpty
df.limit(1).count == 0
I see df.isEmpty
has the following implementation:
def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
Looks like it does more than directly counting.
What's the idea behind that groupBy
? Is it just to get a dataframe?
Why is the queryExecution
plan used?
In this post, I see three different questions.
If you check carefully the source code, you can see df.count
does the same groupBy
to get a RelationalGroupedDateset
So, if we compare both implementations :
def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) {
plan => plan.executeCollect().head.getLong(0) == 0
}
def count(): Long = withAction("count", groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0)
}
df.isEmpty
and df.limit(1).count() == 0
are acting exactly the same behind the scenes.
However, I would go for df.isEmpty
for the clarity of the name.
queryExecution
plan used ?The query Execution plan is an attribute which is needed to have the global execution plan.
Each time a transformation is done, the queryExecution
will be upgraded with this transformation.
Each time an action is done, the queryExecution
is retrieved, and the Catalyst plan optimises it.
groupBy
?The count
method creates a RelationalGroupedDataset
with a single group. This group is then populate with Literal(1)
and then reduce by key (it contains no key, so it reduces every columns) concurrently to get a DataFrame
with a single column called "count" with only one row containing the count. (This is why in the df.count
implementation we can see a .head.getLong(0)
This implementation allows you to reduce concurrently in every partitions instead of creating an iterator to count.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With