I have two <code>DataFrame</code> <code>a</code> and <code>b</code>. <code>a</code> is like <pre class="prettyprint lang-none prettyprint-override"><code>Column 1 | Column 2 abc | 123 cde | 23 </code></pre> <code>b</code> is like <pre class="prettyprint lang-none prettyprint-override"><code>Column 1 1 2 </code></pre> I want to zip <code>a</code> and <code>b</code> (or even more) DataFrames which becomes something like: <pre class="prettyprint lang-none prettyprint-override"><code>Column 1 | Column 2 | Column 3 abc | 123 | 1 cde | 23 | 2 </code></pre> How can I do it?

In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation: <pre class="prettyprint"><code>val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2") val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId) val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3") val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId) aWithId.join(bWithId, "id") </code></pre> A little light reading - Check out how Python does this!

How to zip two (or more) DataFrame in Spark

Tags:

dataframe

scala

apache-spark

apache-spark-sql

I have two DataFrame a and b. a is like

Column 1 | Column 2
abc      |  123
cde      |  23

b is like

Column 1 
1      
2

I want to zip a and b (or even more) DataFrames which becomes something like:

Column 1 | Column 2 | Column 3
abc      |  123     |   1
cde      |  23      |   2

How can I do it?

656

asked Oct 01 '15 08:10

worldterminator

2 Answers

Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}

val a: DataFrame = sc.parallelize(Seq(
  ("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")

// Merge rows
val rows = a.rdd.zip(b.rdd).map{
  case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}

// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)

// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)

If above conditions are not met the only option that comes to mind is adding an index and join:

def addIndex(df: DataFrame) = sqlContext.createDataFrame(
  // Add index
  df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
  // Create schema
  StructType(df.schema.fields :+ StructField("_index", LongType, false))
)

// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)

// Join and clean
val ab = aWithIndex
  .join(bWithIndex, Seq("_index"))
  .drop("_index")

answered Sep 18 '22 08:09

zero323

In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:

val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)

val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)

aWithId.join(bWithId, "id")

A little light reading - Check out how Python does this!

answered Sep 20 '22 08:09

Sohum Sachdev

Related questions
                            
                                Scala multiple with
                            
                                Are Scala's `AnyVal`s stack allocated?
                            
                                Scala Buffer: Size or Length?
                            
                                Akka mapTo versus asInstanceOf
                            
                                Difference between >> and >>> in Scala
                            
                                How could I initialize val object in the try catch block?
                            
                                Play Json: Transforming a Reads[T] to Reads[Seq[T]] without implicits
                            
                                Combining multiple Lists of arbitrary length
                            
                                scala.collection.breakOut vs views
                            
                                Profiling a Scala Spark application
                            
                                Slick 3 multiple outer joins
                            
                                Count on Spark Dataframe is extremely slow
                            
                                How to write a class destructor in Scala?
                            
                                Is it possible in Scala to force the caller to specify a type parameter for a polymorphic method?
                            
                                Implicit conversion, import required or not?
                            
                                How can I get automatic dependency resolution in my scala scripts?
                            
                                Scala type inference on overloaded method
                            
                                How to set up PostgreSQL for Play 2.0?
                            
                                How to test a value on being AnyVal?
                            
                                Using `map` function on Map in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With