Can anyone suggest typical scenarios where Partitioner
class introduced in .NET 4.0 can/should be used?
Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job.
Before reduce phase, partitioning of the map output take place on the basis of the key. Hadoop Partitioning specifies that all the values for each key are grouped together. It also makes sure that all the values of a single key go to the same reducer. This allows even distribution of the map output over the reducer.
According to the key-value each mapper output is partitioned and records having the same key value go into the same partition (within each mapper), and then each partition is sent to a reducer.
A Custom Partitioner can be written by overriding the getPartition method. The getPartition method takes two parameters which is the key and value. In the Reducer, we just need to collect the <key,value> pairs from the Custom Partitioner and write a logic to find the highest age in each flight and print out the result.
The Partitioner
class is used to make parallel executions more chunky. If you have a lot of very small tasks to run in parallel the overhead of invoking delegates for each may be prohibitive. By using Partitioner
, you can rearrange the workload into chunks and have each parallel invocation work on a slightly larger set. The class abstracts this feature and is able to partition based on the actual conditions of the dataset and available cores.
Example: Imagine you want to run a simple calculation like this in parallel.
Parallel.ForEach(Input, (value, loopState, index) => { Result[index] = value*Math.PI; });
That would invoke the delegate for each entry in Input. Doing so would add a bit of overhead to each. By using Partitioner
we can do something like this
Parallel.ForEach(Partitioner.Create(0, Input.Length), range => { for (var index = range.Item1; index < range.Item2; index++) { Result[index] = Input[index]*Math.PI; } });
This will reduce the number of invokes as each invoke will work on a larger set. In my experience this can boost performance significantly when parallelizing very simple operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With