I making some tests in JUnit and I need to check the equality of two Spark RDDs. A way I thought of doing it is this: <pre class="prettyprint"><code>JavaRDD<SomeClass> expResult = ...; JavaRDD<SomeClass> result = ...; assertEquals(expResult.collect(), result.collect()); </code></pre> Is there a better way than this?

If the expected result is reasonably small, it's best to <code>collect</code> RDD data and compare it locally (just like you've written). When it's necessary to use large enough datasets in tests, there are few other possibilities: Disclaimer: I'm not familiar enough with Spark Java API, so I'll write further sample code in Scala. I hope it won't be a problem, since it might be either rewritten in Java or converted into a couple of utility functions invoked from Java code. <h3>Method 1: Zip RDDs together and compare item-by-item</h3> This method is only usable if the order of elements in RDD is well defined (i.e., RDDs are sorted). <pre class="prettyprint"><code>val diff = expResult .zip(result) .collect { case (a, b) if a != b => a -> b } .take(100) </code></pre> The <code>diff</code> array will contain up to 100 differentiating pairs. If the RDDs are big enough, and you'd like to obtain all items from <code>diff</code> locally, it's possible to use <code>toLocalIterator</code> method. It's better not to use <code>collect</code> method, since you may run OOM. This method is probably the fastest, since it doesn't require shuffle, but it might be only used if the order of partitions in RDDs and the order of items in partitions is well defined. <h3>Method 2: Co-group RDDs</h3> This method might be used to test if the <code>result</code> RDD contains specified (possibly non-unique) values without any particular order <pre class="prettyprint"><code> val diff = expResult.map(_ -> 1) .cogroup(result.map(_ -> 1)) .collect { case (a, (i1, i2)) if i1.sum != i2.sum => a -> (i1.sum - i2.sum) } .take(100) </code></pre> The <code>diff</code> array will contain the differentiating values together with difference between amounts. For example: <ul> <li>if <code>expResult</code> contains single instance of some value and <code>result</code> doesn't contain that value, the number will be <code>+1</code>;</li> <li>If <code>result</code> contains 3 instances of another value, and <code>expResult</code> only 1, the number will be <code>-2</code>.</li> </ul> This method will be faster than other options (i.e., substracting RDDs from each other), since it requires only one shuffle.

Checking for equality of RDDs

Tags:

java

junit

equals

apache-spark

I making some tests in JUnit and I need to check the equality of two Spark RDDs.

A way I thought of doing it is this:

JavaRDD<SomeClass> expResult = ...;
JavaRDD<SomeClass> result = ...;

assertEquals(expResult.collect(), result.collect());

Is there a better way than this?

515

asked Nov 30 '14 13:11

Aki K

1 Answers

If the expected result is reasonably small, it's best to collect RDD data and compare it locally (just like you've written).

When it's necessary to use large enough datasets in tests, there are few other possibilities:

Disclaimer: I'm not familiar enough with Spark Java API, so I'll write further sample code in Scala. I hope it won't be a problem, since it might be either rewritten in Java or converted into a couple of utility functions invoked from Java code.

Method 1: Zip RDDs together and compare item-by-item

This method is only usable if the order of elements in RDD is well defined (i.e., RDDs are sorted).

val diff = expResult
  .zip(result)
  .collect { case (a, b) if a != b => a -> b }
  .take(100)

The diff array will contain up to 100 differentiating pairs. If the RDDs are big enough, and you'd like to obtain all items from diff locally, it's possible to use toLocalIterator method. It's better not to use collect method, since you may run OOM.

This method is probably the fastest, since it doesn't require shuffle, but it might be only used if the order of partitions in RDDs and the order of items in partitions is well defined.

Method 2: Co-group RDDs

This method might be used to test if the result RDD contains specified (possibly non-unique) values without any particular order

  val diff = expResult.map(_ -> 1)
    .cogroup(result.map(_ -> 1))
    .collect { case (a, (i1, i2)) if i1.sum != i2.sum => a -> (i1.sum - i2.sum) }
    .take(100)

The diff array will contain the differentiating values together with difference between amounts.

For example:

if expResult contains single instance of some value and result doesn't contain that value, the number will be +1;
If result contains 3 instances of another value, and expResult only 1, the number will be -2.

This method will be faster than other options (i.e., substracting RDDs from each other), since it requires only one shuffle.

119

answered Sep 30 '22 07:09

Wildfire

Related questions
                            
                                Sending command to java -jar using stdin via /proc/{pid}/fd/0
                            
                                Where to put keystore for Tomcat web app using Apache HTTP client
                            
                                Android Studio Annotation AbstractProcessor Not Found
                            
                                Creating a Map using Java8 streams on a nested Data Structure
                            
                                Spring Data MongoDB and allowDiskUse
                            
                                Hibernate and Spring DataSourceTransactionManager
                            
                                How <T> is dealing here with String and Integer
                            
                                Logstash + Kibana terms panel without breaking words
                            
                                Formatting Dates while using the UCanAccess JDBC driver
                            
                                Java: Call a base super class method while skipping intermediate inherited super classes [duplicate]
                            
                                Get json user from json tweet
                            
                                Is there an easy way to create a logger instance for every class?
                            
                                date and time picker in JAVA [closed]
                            
                                JDBC driver doesn't support batch update with retrieval of identity column. Why?
                            
                                Google App Engine Task Queue gets a 404 when invoking Google Cloud Endpoints API
                            
                                Animate new ListView entries in Javafx
                            
                                FAIL - Deployed application at context path /RxCircle but context failed to start
                            
                                How to set a Spring profile to a package?
                            
                                How to prevent null check before equals
                            
                                How to iterate hashmap in reverse order in Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With