So assume ive got an rdd with 3000 rows. The 2000 first rows are of class 1 and the 1000 last rows are of class2. The RDD is partitioned across 100 partitions. When calling <code>RDD.randomSplit(0.8,0.2)</code> Does the function also shuffle the rdd? Our does the splitting simply sample 20% continuously of the rdd? Or does it select 20% of the partitions randomly? Ideally does the resulting split have the same class distribution as the original RDD. (i.e. 2:1) Thanks

For each range defined by <code>weights</code> array there is a separate <code>mapPartitionsWithIndex</code> transformation which preserves partitioning. Each partition is sampled using a set of <code>BernoulliCellSamplers</code>. For each split it iterates over the elements of a given partition and selects item if value of the next random <code>Double</code> is in a given range defined by normalized weights. All samplers for a given partition use the same RNG seed. It means it: <ul> <li>doesn't shuffle a RDD</li> <li>doesn't take continuous blocks other than by chance</li> <li>takes a random sample from each partition</li> <li>takes non-overlapping samples</li> <li>require n-splits passes over data</li> </ul>

How does Sparks RDD.randomSplit actually split the RDD

1 Answers

For each range defined by weights array there is a separate mapPartitionsWithIndex transformation which preserves partitioning.

Each partition is sampled using a set of BernoulliCellSamplers. For each split it iterates over the elements of a given partition and selects item if value of the next random Double is in a given range defined by normalized weights. All samplers for a given partition use the same RNG seed. It means it:

doesn't shuffle a RDD
doesn't take continuous blocks other than by chance
takes a random sample from each partition
takes non-overlapping samples
require n-splits passes over data

194

answered Sep 20 '22 14:09

zero323

Related questions
                            
                                JSON.NET serialize JObject while ignoring null properties
                            
                                Incrementor logic
                            
                                iOS: How to fix the following warning issues?
                            
                                Could not find Developer Disk Image iOS 9.1 & Xcode Version 7.0.1 (7A1001)
                            
                                CollectionProxy vs AssociationRelation
                            
                                matplotlib axis arrow tip
                            
                                Google Speech Recognition API: timestamp for each word?
                            
                                CMake seems to ignore CMAKE_OSX_DEPLOYMENT_TARGET
                            
                                How to change font size of the scientific notation in matplotlib?
                            
                                is a select option with no value, valid?
                            
                                Entity Framework 6 set connection string in code
                            
                                How to write summaries for multiple runs in Tensorflow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does Sparks RDD.randomSplit actually split the RDD

Tags:

Madzor

People also ask

1 Answers

zero323

Recent Activity

Donate For Us