Method 1: isEmpty() The isEmpty function of the DataFrame or Dataset returns true when the DataFrame is empty and false when it's not empty. If the dataframe is empty, invoking “isEmpty” might result in NullPointerException. Note : calling df. head() and df.
Spark Find Count of Null, Empty String of a DataFrame Column. To find null or empty on a single column, simply use Spark DataFrame filter() with multiple conditions and apply count() action. The below example finds the number of records with null or empty for the name column.
empty. True if NDFrame is entirely empty [no items], meaning any of the axes are of length 0. If NDFrame contains only NaNs, it is still not considered empty.
In PySpark DataFrame you can calculate the count of Null, None, NaN & Empty/Blank values in a column by using isNull() of Column class & SQL functions isnan() count() and when().
For Spark 2.1.0, my suggestion would be to use head(n: Int)
or take(n: Int)
with isEmpty
, whichever one has the clearest intent to you.
df.head(1).isEmpty
df.take(1).isEmpty
with Python equivalent:
len(df.head(1)) == 0 # or bool(df.head(1))
len(df.take(1)) == 0 # or bool(df.take(1))
Using df.first()
and df.head()
will both return the java.util.NoSuchElementException
if the DataFrame is empty. first()
calls head()
directly, which calls head(1).head
.
def first(): T = head()
def head(): T = head(1).head
head(1)
returns an Array, so taking head
on that Array causes the java.util.NoSuchElementException
when the DataFrame is empty.
def head(n: Int): Array[T] = withAction("head", limit(n).queryExecution)(collectFromPlan)
So instead of calling head()
, use head(1)
directly to get the array and then you can use isEmpty
.
take(n)
is also equivalent to head(n)
...
def take(n: Int): Array[T] = head(n)
And limit(1).collect()
is equivalent to head(1)
(notice limit(n).queryExecution
in the head(n: Int)
method), so the following are all equivalent, at least from what I can tell, and you won't have to catch a java.util.NoSuchElementException
exception when the DataFrame is empty.
df.head(1).isEmpty
df.take(1).isEmpty
df.limit(1).collect().isEmpty
I know this is an older question so hopefully it will help someone using a newer version of Spark.
I would say to just grab the underlying RDD
. In Scala:
df.rdd.isEmpty
in Python:
df.rdd.isEmpty()
That being said, all this does is call take(1).length
, so it'll do the same thing as Rohan answered...just maybe slightly more explicit?
I had the same question, and I tested 3 main solution :
(df != null) && (df.count > 0)
df.head(1).isEmpty()
as @hulin003 suggestdf.rdd.isEmpty()
as @Justin Pihony suggestand of course the 3 works, however in term of perfermance, here is what I found, when executing the these methods on the same DF in my machine, in terme of execution time :
therefore I think that the best solution is df.rdd.isEmpty()
as @Justin Pihony suggest
Since Spark 2.4.0 there is Dataset.isEmpty
.
It's implementation is :
def isEmpty: Boolean =
withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
Note that a DataFrame
is no longer a class in Scala, it's just a type alias (probably changed with Spark 2.0):
type DataFrame = Dataset[Row]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With