Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to partition RDD by key in Spark?

Given that the HashPartitioner docs say:

[HashPartitioner] implements hash-based partitioning using Java's Object.hashCode.

Say I want to partition DeviceData by its kind.

case class DeviceData(kind: String, time: Long, data: String)

Would it be correct to partition an RDD[DeviceData] by overwriting the deviceData.hashCode() method and use only the hashcode of kind?

But given that HashPartitioner takes a number of partitions parameter I am confused as to whether I need to know the number of kinds in advance and what happens if there are more kinds than partitions?

Is it correct that if I write partitioned data to disk it will stay partitioned when read?

My goal is to call

  deviceDataRdd.foreachPartition(d: Iterator[DeviceData] => ...)

And have only DeviceData's of the same kind value in the iterator.

like image 518
BAR Avatar asked Sep 12 '15 22:09

BAR


1 Answers

How about just doing a groupByKey using kind. Or another PairRDDFunctions method.

You make it seem to me that you don't really care about the partitioning, just that you get all of a specific kind in one processing flow?

The pair functions allow this:

rdd.keyBy(_.kind).partitionBy(new HashPartitioner(PARTITIONS))
   .foreachPartition(...)

However, you can probably be a little safer with something more like:

rdd.keyBy(_.kind).reduceByKey(....)

or mapValues or a number of the other pair functions that guarantee you get the pieces as a whole

like image 53
Justin Pihony Avatar answered Oct 12 '22 18:10

Justin Pihony