I have an rdd which contains key value pairs. There are just 3 keys, and I would like to write all the elements for a given key to a textfile. Currently I am doing this in 3 passes, but I wanted to see if I could do it in one pass.
Here is what I have so far:
# I have an rdd (called my_rdd) such that a record is a key value pair, e.g.:
# ('data_set_1','value1,value2,value3,...,value100')
my_rdd.cache()
my_keys = ['data_set_1','data_set_2','data_set_3']
for key in my_keys:
my_rdd.filter(lambda l: l[0] == key).map(lambda l: l[1]).saveAsTextFile(my_path+'/'+key)
this works, however caching it and iterating through three times can be a lengthy process. I am wondering if there is any way to simultaneously write all three files?
Alternative approach by using customized Partitioner(which partition your dataset before writing to output file, compared to the approach provided by Def_Os)
For Example:RDD[(K, W)].partitionBy(partitioner: Partitioner)
class CustmozedPartitioner extends Partitioner {
override def numPartitions: Int = 4
override def getPartition(key: Any): Int = {
key match {
case "data_set_1" => 0
case "data_set_2" => 1
case "data_set_3" => 2
case _ => 3
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With