Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Spark is counting the records in an RDD expensive task?

In Hadoop, when I use an inputformat reader the logs at the job level report how many records were read, it also displays the byte count etc.

In Spark when I use the same inputformat reader I get non of those metrics.

So I'm thinking that I would use the inputformat reader to populate the rdd, and then just publish the number of records in the rdd (size of the rdd).

I know that rdd.count() returns the size of the rdd.

However, the cost of using count() is not clear to me? For example:

  • Is it a distributed function? Will each partition report its count and the counts are summed and reported? Or is the entire rdd brought into the driver and counted?
  • After executing the count() will the rdd still remain in memory or do I have to explicitly cache it?
  • Is there a better way to do what I want to do, namely count the records before operating on them?
like image 400
hba Avatar asked Apr 19 '16 16:04

hba


1 Answers

Is it a distributed function? Will each partition report its count and the counts are summed and reported? Or is the entire rdd brought into the driver and counted?

Count is distributed. In spark nomenclature, count is an "Action". All actions are distributed. Really, there are only a handful things that bring all of the data to the driver node and they are generally well documented (eg take, collect etc)

After executing the count() will the rdd still remain in memory or do I have to explicitly cache it?

No, the data will not be in memory. If you want it to be, you need to explicitly cache before counting. Spark's lazy evaluation will not make any computations until an Action is taken. And no data will be stored in memory after an Action unless there was a cache call.

Is there a better way to do what I want to do, namely count the records before operating on them?

Cache, count, operating seems like a solid plan

like image 99
David Avatar answered Sep 28 '22 07:09

David