I have a data frame df
that contains around 1 Gb of data. Why the command df.count()
takes a relatively long time to complete, while df.filter(...)
is much faster? Is there any better way to estimate the number of entries in df
that is faster than df.count()
'
df.count()
is the correct way.
Note that df.filter(...)
is a transformation, which means it is lazy, i.e. the filtering code isn't executed yet. It will only be executed if you add an actiton like count
or collect
to the filtered result. And then the runtime should be similar to the original call to count
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With