Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Sparks RDD.randomSplit actually split the RDD

Tags:

So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2. The RDD is partitioned across 100 partitions.

When calling RDD.randomSplit(0.8,0.2)

Does the function also shuffle the rdd? Our does the splitting simply sample 20% continuously of the rdd? Or does it select 20% of the partitions randomly?

Ideally does the resulting split have the same class distribution as the original RDD. (i.e. 2:1)

Thanks

like image 863
Madzor Avatar asked Oct 04 '15 11:10

Madzor


People also ask

How are RDDs partitioned?

Apache Spark's Resilient Distributed Datasets (RDD) are a collection of various data that are so big in size, that they cannot fit into a single node and should be partitioned across various nodes. Apache Spark automatically partitions RDDs and distributes the partitions across different nodes.

How does spark Read RDD?

1.1 textFile() – Read text file into RDD sparkContext. textFile() method is used to read a text file from HDFS, S3 and any Hadoop supported file system, this method takes the path as an argument and optionally takes a number of partitions as the second argument. Here, it reads every line in a "text01.

How many partitions should a spark RDD have?

One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster.


1 Answers

For each range defined by weights array there is a separate mapPartitionsWithIndex transformation which preserves partitioning.

Each partition is sampled using a set of BernoulliCellSamplers. For each split it iterates over the elements of a given partition and selects item if value of the next random Double is in a given range defined by normalized weights. All samplers for a given partition use the same RNG seed. It means it:

  • doesn't shuffle a RDD
  • doesn't take continuous blocks other than by chance
  • takes a random sample from each partition
  • takes non-overlapping samples
  • require n-splits passes over data
like image 194
zero323 Avatar answered Sep 20 '22 14:09

zero323