Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use the RangePartitioner in Spark

I want to use a RangePartitioner in my Java Spark Application, but I have no clue how to set the two scala parameters scala.math.Ordering<K> evidence$1 and scala.reflect.ClassTag<K> evidence$2. Can someone give me an example?

Here is the link to the JavaDoc of RangePartitioner (it was no help for me because I'm new to Spark and Scala...):

My Code actually looks like:

JavaPairRDD<Integer, String> partitionedRDD = rdd.partitionBy(new RangePartitioner<Integer, String>(10, rdd, true, evidence$1, evidence$2));
like image 954
D. Müller Avatar asked Jun 09 '15 09:06

D. Müller


People also ask

How do I use range partition in spark?

Range Partitioning in Apache Spark Through this method, tuples those have keys within the same range will appear on the same machine. In range partitioner, keys are partitioned based on an ordering of keys. Also, depends on the set of sorted range of keys.

How do I control the number of partitions in spark?

repartition() can be used for increasing or decreasing the number of partitions of a Spark DataFrame. However, repartition() involves shuffling which is a costly operation.

What is range partition spark?

A Partitioner that partitions sortable records by range into roughly equal ranges. The ranges are determined by sampling the content of the RDD passed in.


1 Answers

You can create both the Ordering and the ClassTag by calling methods on their companion objects.

These are referred to in java like this: ClassName$.MODULE$.functionName()

One further wrinkle is that the constructor requires a scala RDD, not a java one. You can get the scala RDD from a java PairRDD by calling rdd.rdd()

    final Ordering<Integer> ordering = Ordering$.MODULE$.comparatorToOrdering(Comparator.<Integer>naturalOrder());
    final ClassTag<Integer> classTag = ClassTag$.MODULE$.apply(Integer.class);
    final RangePartitioner<Integer, String> partitioner = new RangePartitioner<>(
            10, 
            rdd.rdd(),   //note the call to rdd.rdd() here, this gets the scala RDD backing the java one
            true,
            ordering,
            classTag);
    final JavaPairRDD<Integer, String> partitioned = rdd.partitionBy(partitioner);
like image 128
whaleberg Avatar answered Sep 28 '22 00:09

whaleberg