What is the faster way to count the number of entries in a data frame?

Question

I have a data frame df that contains around 1 Gb of data. Why the command df.count() takes a relatively long time to complete, while df.filter(...) is much faster? Is there any better way to estimate the number of entries in df that is faster than df.count()'

Harald Gliebe · Accepted Answer

df.count() is the correct way. Note that df.filter(...) is a transformation, which means it is lazy, i.e. the filtering code isn't executed yet. It will only be executed if you add an actiton like count or collect to the filtered result. And then the runtime should be similar to the original call to count.

What is the faster way to count the number of entries in a data frame?

Tags:

scala

apache-spark

apache-spark-sql

Dinosaurius

1 Answers

Harald Gliebe

Recent Activity

Donate For Us

What is the faster way to count the number of entries in a data frame?

Tags:

scala

apache-spark

apache-spark-sql

Dinosaurius

1 Answers

Harald Gliebe

Related questions

Recent Activity

Donate For Us