Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check if spark dataframe is empty?

People also ask

How do I check if a data frame is empty in spark?

Method 1: isEmpty() The isEmpty function of the DataFrame or Dataset returns true when the DataFrame is empty and false when it's not empty. If the dataframe is empty, invoking “isEmpty” might result in NullPointerException. Note : calling df. head() and df.

How do I check if a column is empty in spark?

Spark Find Count of Null, Empty String of a DataFrame Column. To find null or empty on a single column, simply use Spark DataFrame filter() with multiple conditions and apply count() action. The below example finds the number of records with null or empty for the name column.

Is DataFrame empty?

empty. True if NDFrame is entirely empty [no items], meaning any of the axes are of length 0. If NDFrame contains only NaNs, it is still not considered empty.

How do you check for null in PySpark?

In PySpark DataFrame you can calculate the count of Null, None, NaN & Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().


For Spark 2.1.0, my suggestion would be to use head(n: Int) or take(n: Int) with isEmpty, whichever one has the clearest intent to you.

df.head(1).isEmpty
df.take(1).isEmpty

with Python equivalent:

len(df.head(1)) == 0  # or bool(df.head(1))
len(df.take(1)) == 0  # or bool(df.take(1))

Using df.first() and df.head() will both return the java.util.NoSuchElementException if the DataFrame is empty. first() calls head() directly, which calls head(1).head.

def first(): T = head()
def head(): T = head(1).head

head(1) returns an Array, so taking head on that Array causes the java.util.NoSuchElementException when the DataFrame is empty.

def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)

So instead of calling head(), use head(1) directly to get the array and then you can use isEmpty.

take(n) is also equivalent to head(n)...

def take(n: Int): Array[T] = head(n)

And limit(1).collect() is equivalent to head(1) (notice limit(n).queryExecution in the head(n: Int) method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException exception when the DataFrame is empty.

df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty

I know this is an older question so hopefully it will help someone using a newer version of Spark.


I would say to just grab the underlying RDD. In Scala:

df.rdd.isEmpty

in Python:

df.rdd.isEmpty()

That being said, all this does is call take(1).length, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?


I had the same question, and I tested 3 main solution :

  1. (df != null) && (df.count > 0)
  2. df.head(1).isEmpty() as @hulin003 suggest
  3. df.rdd.isEmpty() as @Justin Pihony suggest

and of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :

  1. it takes ~9366ms
  2. it takes ~5607ms
  3. it takes ~1921ms

therefore I think that the best solution is df.rdd.isEmpty() as @Justin Pihony suggest


Since Spark 2.4.0 there is Dataset.isEmpty.

It's implementation is :

def isEmpty: Boolean = 
  withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
}

Note that a DataFrame is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):

type DataFrame = Dataset[Row]