I'm trying to wrap my head around these two functions in the Spark SQL documentation– <ul> <li> def union(other: RDD[Row]): RDD[Row] Return the union of this RDD and another one. </li> <li> def unionAll(otherPlan: SchemaRDD): SchemaRDD Combines the tuples of two RDDs with the same schema, keeping duplicates. </li> </ul> This is not the standard behavior of UNION vs UNION ALL, as documented in this SO question. My code here, borrowing from the Spark SQL documentation, has the two functions returning the same results. <pre class="prettyprint"><code>scala> case class Person(name: String, age: Int) scala> import org.apache.spark.sql._ scala> val one = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2))) scala> val two = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2), Person("Gamma", 3))) scala> val schemaString = "name age" scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true))) scala> val peopleSchemaRDD1 = sqlContext.applySchema(one, schema) scala> val peopleSchemaRDD2 = sqlContext.applySchema(two, schema) scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3]) scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect res35: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3]) </code></pre> Why would I prefer one over the other?

In Spark 1.6, the above version of <code>union</code> was removed, so <code>unionAll</code> was all that remained. In Spark 2.0, <code>unionAll</code> was renamed to <code>union</code>, with <code>unionAll</code> kept in for backward compatibility (I guess). In any case, no deduplication is done in either <code>union</code> (Spark 2.0) or <code>unionAll</code> (Spark 1.6).

<code>unionAll()</code> was deprecated in Spark 2.0, and for all future reference, <code>union()</code> is the only recommended method. In either case, <code>union</code> or <code>unionAll</code>, both do not do a SQL style deduplication of data. In order to remove any duplicate rows, just use <code>union()</code> followed by a <code>distinct()</code>.

Why would I want .union over .unionAll in Spark for SchemaRDDs?

Tags:

sql

union

union-all

scala

apache-spark

I'm trying to wrap my head around these two functions in the Spark SQL documentation–

def union(other: RDD[Row]): RDD[Row]

Return the union of this RDD and another one.
def unionAll(otherPlan: SchemaRDD): SchemaRDD

Combines the tuples of two RDDs with the same schema, keeping duplicates.

This is not the standard behavior of UNION vs UNION ALL, as documented in this SO question.

My code here, borrowing from the Spark SQL documentation, has the two functions returning the same results.

scala> case class Person(name: String, age: Int)
scala> import org.apache.spark.sql._
scala> val one = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2)))
scala> val two = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2),  Person("Gamma", 3)))
scala> val schemaString = "name age"
scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
scala> val peopleSchemaRDD1 = sqlContext.applySchema(one, schema)
scala> val peopleSchemaRDD2 = sqlContext.applySchema(two, schema)
scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect
res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect
res35: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])

Why would I prefer one over the other?

782

asked Mar 12 '15 23:03

duber

2 Answers

In Spark 1.6, the above version of union was removed, so unionAll was all that remained.

In Spark 2.0, unionAll was renamed to union, with unionAll kept in for backward compatibility (I guess).

In any case, no deduplication is done in either union (Spark 2.0) or unionAll (Spark 1.6).

162

answered Sep 20 '22 08:09

Kris

unionAll() was deprecated in Spark 2.0, and for all future reference, union() is the only recommended method.

In either case, union or unionAll, both do not do a SQL style deduplication of data. In order to remove any duplicate rows, just use union() followed by a distinct().

answered Sep 22 '22 08:09

Keshav Potluri

Related questions
                            
                                sql server Get the FULL month name from a date
                            
                                How to use 'WHERE' clause using ssp.class.php DataTables
                            
                                What can cause an Oracle ROWID to change?
                            
                                Database Design and the use of non-numeric Primary Keys
                            
                                Execute stored procedure w/parameters in Dapper
                            
                                Performance issue in update query
                            
                                What does/should NULL mean along with FK relationships - Database
                            
                                How to do "select current_timestamp" in hsqldb?
                            
                                How to efficiently check if a table is empty?
                            
                                How to replace blank (null ) values with 0 for all records?
                            
                                selecting records in mysql db from 1, 7, and 30 days ago with datetime and php
                            
                                How to find the number of days between two dates
                            
                                JPA- Joining two tables in non-entity class
                            
                                SQL SELECT TOP 1 FOR EACH GROUP
                            
                                Duration of data in a Global Temporary table?
                            
                                SQL - Calculate percentage on count(column)
                            
                                Combining UNION ALL and ORDER BY in Firebird
                            
                                Select Nth Row From A Table In Oracle
                            
                                Understanding how WHERE works with GROUP BY and Aggregation
                            
                                SSRS Sum Expression with Condition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With