Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which method is better to check if a dataframe is empty ? `df.limit(1).count == 0` or `df.isEmpty`?

For the below 2 approaches to check if a dataframe is empty:

  1. df.isEmpty
  2. df.limit(1).count == 0

I see df.isEmpty has the following implementation:

  def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
  }

Looks like it does more than directly counting.

What's the idea behind that groupBy? Is it just to get a dataframe?

Why is the queryExecution plan used?

like image 294
apnith Avatar asked Feb 03 '23 17:02

apnith


1 Answers

In this post, I see three different questions.

Performances

If you check carefully the source code, you can see df.count does the same groupBy to get a RelationalGroupedDateset

So, if we compare both implementations :

def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { 
  plan => plan.executeCollect().head.getLong(0) == 0
}

def count(): Long = withAction("count", groupBy().count().queryExecution) { plan =>
  plan.executeCollect().head.getLong(0)
}

df.isEmpty and df.limit(1).count() == 0 are acting exactly the same behind the scenes.

However, I would go for df.isEmpty for the clarity of the name.

Why is the queryExecution plan used ?

The query Execution plan is an attribute which is needed to have the global execution plan.

Each time a transformation is done, the queryExecution will be upgraded with this transformation.

Each time an action is done, the queryExecution is retrieved, and the Catalyst plan optimises it.

What's the idea behind that groupBy ?

The count method creates a RelationalGroupedDataset with a single group. This group is then populate with Literal(1) and then reduce by key (it contains no key, so it reduces every columns) concurrently to get a DataFrame with a single column called "count" with only one row containing the count. (This is why in the df.count implementation we can see a .head.getLong(0)

This implementation allows you to reduce concurrently in every partitions instead of creating an iterator to count.

like image 183
BlueSheepToken Avatar answered Feb 07 '23 00:02

BlueSheepToken