How can I send a specific record to all my reducers ?
I know the Partitioner class and what it does, but I don't see any easy way of making sure a record goes to all the reducers.
Basically, the Partitioner has this method:
int getPartition(K2 key,
V2 value,
int numPartitions)
My first idea was to have the Partitioner and the Mapper collaborate as follows: the Mapper keeps outputting the record a number of times equal to the number of reduce tasks and the Partitioner returns all the ints (from 0 to numPartitions-1) , this way being sure the record reaches all the partitions.
Are there any other, smarter ways of solving this ? For instance I return -1 for the records that I need sent to all partitions and the framework does that for me when it sees the returned -1.
The partitioner doesn't work that way. Its job is to look at the key (usually) and the value (rarely) to determine which reducer the pair should be sent to. This happens after the mapper and before the reducer.
Instead, you (the mapper) should be able to ask the context for the configuration which can answer the total number of reducers (partitions). Your mapper can then output a complex key comprising the actual key you want and a partition number. You know how many times to write this out because the mapper can find out the number of reducers (see above). All the partitioner has to do is breakdown the composite key value, extract the target reducer index and return that index.
By the way, this means that if you're using this technique to send out counts (if you're sorting) or other metadata to be used later in the processing then your real data keys have to follow the same composite format. In fact, you'll probably have to include in the composite key an indicator describing the kind of key/value pair it is (e.g. 1=real data, 0=processing metadata).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With