Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why would I want .union over .unionAll in Spark for SchemaRDDs?

I'm trying to wrap my head around these two functions in the Spark SQL documentation–

  • def union(other: RDD[Row]): RDD[Row]

    Return the union of this RDD and another one.

  • def unionAll(otherPlan: SchemaRDD): SchemaRDD

    Combines the tuples of two RDDs with the same schema, keeping duplicates.

This is not the standard behavior of UNION vs UNION ALL, as documented in this SO question.

My code here, borrowing from the Spark SQL documentation, has the two functions returning the same results.

scala> case class Person(name: String, age: Int)
scala> import org.apache.spark.sql._
scala> val one = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2)))
scala> val two = sc.parallelize(Array(Person("Alpha",1), Person("Beta",2),  Person("Gamma", 3)))
scala> val schemaString = "name age"
scala> val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
scala> val peopleSchemaRDD1 = sqlContext.applySchema(one, schema)
scala> val peopleSchemaRDD2 = sqlContext.applySchema(two, schema)
scala> peopleSchemaRDD1.union(peopleSchemaRDD2).collect
res34: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])
scala> peopleSchemaRDD1.unionAll(peopleSchemaRDD2).collect
res35: Array[org.apache.spark.sql.Row] = Array([Alpha,1], [Beta,2], [Alpha,1], [Beta,2], [Gamma,3])

Why would I prefer one over the other?

like image 782
duber Avatar asked Mar 12 '15 23:03

duber


People also ask

What is difference between union and union all in spark?

UNION and UNION ALL return the rows that are found in either relation. UNION (alternatively, UNION DISTINCT ) takes only distinct rows while UNION ALL does not remove duplicates from the result rows.

Is unionAll deprecated?

The DataFrame unionAll() function or the method of the data frame is widely used and is deprecated since the Spark “2.0. 0” version and is further replaced with union().

What is Union in spark?

The Union is a transformation in Spark that is used to work with multiple data frames in Spark. It takes the data frame as the input and the return type is a new data frame containing the elements that are in data frame1 as well as in data frame2.

Is Union all deprecated?

DataFrame unionAll() – unionAll() is deprecated since Spark “2.0. 0” version and replaced with union(). Note: In other SQL's, Union eliminates the duplicates but UnionAll combines two datasets including duplicate records. But, in spark both behave the same and use DataFrame duplicate function to remove duplicate rows.


2 Answers

In Spark 1.6, the above version of union was removed, so unionAll was all that remained.

In Spark 2.0, unionAll was renamed to union, with unionAll kept in for backward compatibility (I guess).

In any case, no deduplication is done in either union (Spark 2.0) or unionAll (Spark 1.6).

like image 162
Kris Avatar answered Sep 20 '22 08:09

Kris


unionAll() was deprecated in Spark 2.0, and for all future reference, union() is the only recommended method.

In either case, union or unionAll, both do not do a SQL style deduplication of data. In order to remove any duplicate rows, just use union() followed by a distinct().

like image 22
Keshav Potluri Avatar answered Sep 22 '22 08:09

Keshav Potluri