Which method is better to check if a dataframe is empty ? `df.limit(1).count == 0` or `df.isEmpty`?

Question

For the below 2 approaches to check if a dataframe is empty:

df.isEmpty
df.limit(1).count == 0

I see df.isEmpty has the following implementation:

  def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
  }

Looks like it does more than directly counting.

What's the idea behind that groupBy? Is it just to get a dataframe?

Why is the queryExecution plan used?

BlueSheepToken · Accepted Answer

In this post, I see three different questions.

Performances

If you check carefully the source code, you can see df.count does the same groupBy to get a RelationalGroupedDateset

So, if we compare both implementations :

def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { 
  plan => plan.executeCollect().head.getLong(0) == 0
}

def count(): Long = withAction("count", groupBy().count().queryExecution) { plan =>
  plan.executeCollect().head.getLong(0)
}

df.isEmpty and df.limit(1).count() == 0 are acting exactly the same behind the scenes.

However, I would go for df.isEmpty for the clarity of the name.

Why is the `queryExecution` plan used ?

The query Execution plan is an attribute which is needed to have the global execution plan.

Each time a transformation is done, the queryExecution will be upgraded with this transformation.

Each time an action is done, the queryExecution is retrieved, and the Catalyst plan optimises it.

What's the idea behind that `groupBy` ?

The count method creates a RelationalGroupedDataset with a single group. This group is then populate with Literal(1) and then reduce by key (it contains no key, so it reduces every columns) concurrently to get a DataFrame with a single column called "count" with only one row containing the count. (This is why in the df.count implementation we can see a .head.getLong(0)

This implementation allows you to reduce concurrently in every partitions instead of creating an iterator to count.

Which method is better to check if a dataframe is empty ? `df.limit(1).count == 0` or `df.isEmpty`?

Tags:

scala

apache-spark

apache-spark-sql

apnith

1 Answers

Performances

Why is the `queryExecution` plan used ?

What's the idea behind that `groupBy` ?

BlueSheepToken

Recent Activity

Donate For Us

Which method is better to check if a dataframe is empty ? `df.limit(1).count == 0` or `df.isEmpty`?

Tags:

scala

apache-spark

apache-spark-sql

apnith

1 Answers

Performances

Why is the queryExecution plan used ?

What's the idea behind that groupBy ?

BlueSheepToken

Related questions

Recent Activity

Donate For Us

Why is the `queryExecution` plan used ?

What's the idea behind that `groupBy` ?