I have following DataFrame: <pre class="prettyprint"><code> |-----id-------|----value------|-----desc------| | 1 | v1 | d1 | | 1 | v2 | d2 | | 2 | v21 | d21 | | 2 | v22 | d22 | |--------------|---------------|---------------| </code></pre> I want to transform it into: <pre class="prettyprint"><code> |-----id-------|----value------|-----desc------| | 1 | v1;v2 | d1;d2 | | 2 | v21;v22 | d21;d22 | |--------------|---------------|---------------| </code></pre> <ul> <li>Is it possible through data frame operations?</li> <li>How would rdd transformation look like in this case?</li> </ul> I presume rdd.reduce is the key, but I have no idea how to adapt it to this scenario.

You can transform your data using spark sql <pre class="prettyprint"><code>case class Test(id: Int, value: String, desc: String) val data = sc.parallelize(Seq((1, "v1", "d1"), (1, "v2", "d2"), (2, "v21", "d21"), (2, "v22", "d22"))) .map(line => Test(line._1, line._2, line._3)) .df data.registerTempTable("data") val result = sqlContext.sql("select id,concat_ws(';', collect_list(value)),concat_ws(';', collect_list(value)) from data group by id") result.show </code></pre>

Suppose you have something like <pre class="prettyprint"><code>import scala.util.Random val sqlc: SQLContext = ??? case class Record(id: Long, value: String, desc: String) val testData = for { (i, j) <- List.fill(30)(Random.nextInt(5), Random.nextInt(5)) } yield Record(i, s"v$i$j", s"d$i$j") val df = sqlc.createDataFrame(testData) </code></pre> You can easily join data as: <pre class="prettyprint"><code>import sqlc.implicits._ def aggConcat(col: String) = df .map(row => (row.getAs[Long]("id"), row.getAs[String](col))) .aggregateByKey(Vector[String]())(_ :+ _, _ ++ _) val result = aggConcat("value").zip(aggConcat("desc")).map{ case ((id, value), (_, desc)) => (id, value, desc) }.toDF("id", "values", "descs") </code></pre> If you would like to have concatenated strings instead of arrays, you can run later <pre class="prettyprint"><code>import org.apache.spark.sql.functions._ val resultConcat = result .withColumn("values", concat_ws(";", $"values")) .withColumn("descs" , concat_ws(";", $"descs" )) </code></pre>

Spark: group concat equivalent in scala rdd

    |-----id-------|----value------|-----desc------|
    |     1        |     v1        |      d1       |
    |     1        |     v2        |      d2       |
    |     2        |     v21       |      d21      |
    |     2        |     v22       |      d22      |
    |--------------|---------------|---------------|

I want to transform it into:

    |-----id-------|----value------|-----desc------|
    |     1        |     v1;v2     |      d1;d2    |
    |     2        |     v21;v22   |      d21;d22  |
    |--------------|---------------|---------------|

Is it possible through data frame operations?
How would rdd transformation look like in this case?

I presume rdd.reduce is the key, but I have no idea how to adapt it to this scenario.

750

asked Dec 08 '15 07:12

Silverrose

Video Answer

3 Answers

You can transform your data using spark sql

case class Test(id: Int, value: String, desc: String)
val data = sc.parallelize(Seq((1, "v1", "d1"), (1, "v2", "d2"), (2, "v21", "d21"), (2, "v22", "d22")))
  .map(line => Test(line._1, line._2, line._3))
  .df

data.registerTempTable("data")
val result = sqlContext.sql("select id,concat_ws(';', collect_list(value)),concat_ws(';', collect_list(value)) from data group by id")
result.show

126

answered Oct 19 '22 02:10

Kaushal

Suppose you have something like

import scala.util.Random

val sqlc: SQLContext = ???

case class Record(id: Long, value: String, desc: String)

val testData = for {
    (i, j) <- List.fill(30)(Random.nextInt(5), Random.nextInt(5))
  } yield Record(i, s"v$i$j", s"d$i$j")

val df = sqlc.createDataFrame(testData)

You can easily join data as:

import sqlc.implicits._

def aggConcat(col: String) = df
      .map(row => (row.getAs[Long]("id"), row.getAs[String](col)))
      .aggregateByKey(Vector[String]())(_ :+ _, _ ++ _)

val result = aggConcat("value").zip(aggConcat("desc")).map{
      case ((id, value), (_, desc)) => (id, value, desc)
    }.toDF("id", "values", "descs")

If you would like to have concatenated strings instead of arrays, you can run later

import org.apache.spark.sql.functions._

val resultConcat =  result
      .withColumn("values", concat_ws(";", $"values"))
      .withColumn("descs" , concat_ws(";", $"descs" ))

answered Oct 19 '22 03:10

Odomontois

If working with DataFrames, use UDAF

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, StringType, StructField, StructType}

class ConcatStringsUDAF(InputColumnName: String, sep:String = ",") extends UserDefinedAggregateFunction {
  def inputSchema: StructType = StructType(StructField(InputColumnName, StringType) :: Nil)
  def bufferSchema: StructType = StructType(StructField("concatString", StringType) :: Nil)
  def dataType: DataType = StringType
  def deterministic: Boolean = true
  def initialize(buffer: MutableAggregationBuffer): Unit = buffer(0) = ""

  private def concatStrings(str1: String, str2: String): String = {
   (str1, str2) match {
      case (s1: String, s2: String) => Seq(s1, s2).filter(_ != "").mkString(sep)
      case (null, s: String) => s
      case (s: String, null) => s
      case _ => ""
    }
  }
  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    val acc1 = buffer.getAs[String](0)
    val acc2 = input.getAs[String](0)
    buffer(0) = concatStrings(acc1, acc2)
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    val acc1 = buffer1.getAs[String](0)
    val acc2 = buffer2.getAs[String](0)
    buffer1(0) = concatStrings(acc1, acc2)
  }

  def evaluate(buffer: Row): Any = buffer.getAs[String](0)
}

And then use this way

val stringConcatener = new ConcatStringsUDAF("Category_ID", ",")
data.groupBy("aaid", "os_country").agg(stringConcatener(data("X")).as("Xs"))

As from Spark 1.6, have a look at Datasets and Aggregator.

answered Oct 19 '22 01:10

Boris

Related questions
                            
                                Designing with immutability (in Scala)
                            
                                Higher Kinded Types in Scala and Haskell
                            
                                Unresolved dependency with specs2 scalaz-stream 0.5a
                            
                                Parboiled2 causes "missing or invalid dependency detected while loading class file 'Prepender.class'"
                            
                                Using `title` with ScalaTags
                            
                                json4s jackson - How to ignore field using annotations
                            
                                AWS S3: Uploading large file fails with ResetException: Failed to reset the request input stream
                            
                                Convert Matrix to RowMatrix in Apache Spark using Scala
                            
                                How to find the name of the enclosing source file in Scala 2.11
                            
                                Difference between Scala REPL and Clojure REPL - compile speed
                            
                                Akka scheduling patterns
                            
                                How do I compose a list of `Futures`?
                            
                                Caused by: java.sql.SQLException: JDBC4 Connection.isValid() method not supported
                            
                                Using v. Not Using the `self` Type
                            
                                Parsing a Json String in Scala using Play framework
                            
                                Akka cluster detecting Quarantined state
                            
                                Understanding closures and parallelism in Spark
                            
                                Idiomatic way to use Spark DStream as Source for an Akka stream
                            
                                Scala compiler optimization for immutability
                            
                                accumulator of Spark is confusing me.

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: group concat equivalent in scala rdd

Tags:

group-concat

scala

apache-spark

rdd

spark-dataframe