I have looked at a number of questions online, but they don't seem to do what I'm trying to achieve. I'm using Apache Spark 2.0.2 with Scala. I have a dataframe: <pre class="prettyprint"><code>+----------+-----+----+----+----+----+----+ |segment_id| val1|val2|val3|val4|val5|val6| +----------+-----+----+----+----+----+----+ | 1| 100| 0| 0| 0| 0| 0| | 2| 0| 50| 0| 0| 20| 0| | 3| 0| 0| 0| 0| 0| 0| | 4| 0| 0| 0| 0| 0| 0| +----------+-----+----+----+----+----+----+ </code></pre> which I want to transpose to <pre class="prettyprint"><code>+----+-----+----+----+----+ |vals| 1| 2| 3| 4| +----+-----+----+----+----+ |val1| 100| 0| 0| 0| |val2| 0| 50| 0| 0| |val3| 0| 0| 0| 0| |val4| 0| 0| 0| 0| |val5| 0| 20| 0| 0| |val6| 0| 0| 0| 0| +----+-----+----+----+----+ </code></pre> I've tried using <code>pivot()</code> but I couldn't get to the right answer. I ended up looping through my <code>val{x}</code> columns, and pivoting each as per below, but this is proving to be very slow. <pre class="prettyprint"><code>val d = df.select('segment_id, 'val1) +----------+-----+ |segment_id| val1| +----------+-----+ | 1| 100| | 2| 0| | 3| 0| | 4| 0| +----------+-----+ d.groupBy('val1).sum().withColumnRenamed('val1', 'vals') +----+-----+----+----+----+ |vals| 1| 2| 3| 4| +----+-----+----+----+----+ |val1| 100| 0| 0| 0| +----+-----+----+----+----+ </code></pre> Then using <code>union()</code> on each iteration of <code>val{x}</code> to my first dataframe. <pre class="prettyprint"><code>+----+-----+----+----+----+ |vals| 1| 2| 3| 4| +----+-----+----+----+----+ |val2| 0| 50| 0| 0| +----+-----+----+----+----+ </code></pre> Is there a more efficient way of a transpose where I do not want to aggregate data? Thanks :)

Unfortunately there is no case when: <ul> <li>Spark <code>DataFrame</code> is justified considering amount of data.</li> <li>Transposition of data is feasible.</li> </ul> You have to remember that <code>DataFrame</code>, as implemented in Spark, is a distributed collection of rows and each row is stored and processed on a single node. You could express transposition on a <code>DataFrame</code> as <code>pivot</code>: <pre class="prettyprint"><code>val kv = explode(array(df.columns.tail.map { c => struct(lit(c).alias("k"), col(c).alias("v")) }: _*)) df .withColumn("kv", kv) .select($"segment_id", $"kv.k", $"kv.v") .groupBy($"k") .pivot("segment_id") .agg(first($"v")) .orderBy($"k") .withColumnRenamed("k", "vals") </code></pre> but it is merely a toy code with no practical applications. In practice it is not better than collecting data: <pre class="prettyprint"><code>val (header, data) = df.collect.map(_.toSeq.toArray).transpose match { case Array(h, t @ _*) => { (h.map(_.toString), t.map(_.collect { case x: Int => x })) } } val rows = df.columns.tail.zip(data).map { case (x, ys) => Row.fromSeq(x +: ys) } val schema = StructType( StructField("vals", StringType) +: header.map(StructField(_, IntegerType)) ) spark.createDataFrame(sc.parallelize(rows), schema) </code></pre> For <code>DataFrame</code> defined as: <pre class="prettyprint"><code>val df = Seq( (1, 100, 0, 0, 0, 0, 0), (2, 0, 50, 0, 0, 20, 0), (3, 0, 0, 0, 0, 0, 0), (4, 0, 0, 0, 0, 0, 0) ).toDF("segment_id", "val1", "val2", "val3", "val4", "val5", "val6") </code></pre> both would you give you the desired result: <pre class="prettyprint"><code>+----+---+---+---+---+ |vals| 1| 2| 3| 4| +----+---+---+---+---+ |val1|100| 0| 0| 0| |val2| 0| 50| 0| 0| |val3| 0| 0| 0| 0| |val4| 0| 0| 0| 0| |val5| 0| 20| 0| 0| |val6| 0| 0| 0| 0| +----+---+---+---+---+ </code></pre> That being said if you need an efficient transpositions on distributed data structure you'll have to look somewhere else. There is a number of structures, including core <code>CoordinateMatrix</code> and <code>BlockMatrix</code>, which can distribute data across both dimensions and can be transposed.

Spark: Transpose DataFrame Without Aggregating

Tags:

scala

apache-spark

I have looked at a number of questions online, but they don't seem to do what I'm trying to achieve.

I'm using Apache Spark 2.0.2 with Scala.

I have a dataframe:

+----------+-----+----+----+----+----+----+
|segment_id| val1|val2|val3|val4|val5|val6|
+----------+-----+----+----+----+----+----+
|         1|  100|   0|   0|   0|   0|   0|
|         2|    0|  50|   0|   0|  20|   0|
|         3|    0|   0|   0|   0|   0|   0|
|         4|    0|   0|   0|   0|   0|   0|
+----------+-----+----+----+----+----+----+

which I want to transpose to

+----+-----+----+----+----+
|vals|    1|   2|   3|   4|
+----+-----+----+----+----+
|val1|  100|   0|   0|   0|
|val2|    0|  50|   0|   0|
|val3|    0|   0|   0|   0|
|val4|    0|   0|   0|   0|
|val5|    0|  20|   0|   0|
|val6|    0|   0|   0|   0|
+----+-----+----+----+----+

I've tried using pivot() but I couldn't get to the right answer. I ended up looping through my val{x} columns, and pivoting each as per below, but this is proving to be very slow.

val d = df.select('segment_id, 'val1)

+----------+-----+
|segment_id| val1|
+----------+-----+
|         1|  100|
|         2|    0|
|         3|    0|
|         4|    0|
+----------+-----+

d.groupBy('val1).sum().withColumnRenamed('val1', 'vals')

+----+-----+----+----+----+
|vals|    1|   2|   3|   4|
+----+-----+----+----+----+
|val1|  100|   0|   0|   0|
+----+-----+----+----+----+

Then using union() on each iteration of val{x} to my first dataframe.

+----+-----+----+----+----+
|vals|    1|   2|   3|   4|
+----+-----+----+----+----+
|val2|    0|  50|   0|   0|
+----+-----+----+----+----+

Is there a more efficient way of a transpose where I do not want to aggregate data?

Thanks :)

696

asked Nov 30 '16 15:11

nevi_me

1 Answers

Unfortunately there is no case when:

Spark DataFrame is justified considering amount of data.
Transposition of data is feasible.

You have to remember that DataFrame, as implemented in Spark, is a distributed collection of rows and each row is stored and processed on a single node.

You could express transposition on a DataFrame as pivot:

val kv = explode(array(df.columns.tail.map { 
  c => struct(lit(c).alias("k"), col(c).alias("v")) 
}: _*))

df
  .withColumn("kv", kv)
  .select($"segment_id", $"kv.k", $"kv.v")
  .groupBy($"k")
  .pivot("segment_id")
  .agg(first($"v"))
  .orderBy($"k")
  .withColumnRenamed("k", "vals")

but it is merely a toy code with no practical applications. In practice it is not better than collecting data:

val (header, data) = df.collect.map(_.toSeq.toArray).transpose match {
  case Array(h, t @ _*) => {
    (h.map(_.toString), t.map(_.collect { case x: Int => x }))
  }
}

val rows = df.columns.tail.zip(data).map { case (x, ys) => Row.fromSeq(x +: ys) }
val schema = StructType(
  StructField("vals", StringType) +: header.map(StructField(_, IntegerType))
)

spark.createDataFrame(sc.parallelize(rows), schema)

For DataFrame defined as:

val df = Seq(
  (1, 100, 0, 0, 0, 0, 0),
  (2, 0, 50, 0, 0, 20, 0),
  (3, 0, 0, 0, 0, 0, 0),
  (4, 0, 0, 0, 0, 0, 0)
).toDF("segment_id", "val1", "val2", "val3", "val4", "val5", "val6")

both would you give you the desired result:

+----+---+---+---+---+
|vals|  1|  2|  3|  4|
+----+---+---+---+---+
|val1|100|  0|  0|  0|
|val2|  0| 50|  0|  0|
|val3|  0|  0|  0|  0|
|val4|  0|  0|  0|  0|
|val5|  0| 20|  0|  0|
|val6|  0|  0|  0|  0|
+----+---+---+---+---+

That being said if you need an efficient transpositions on distributed data structure you'll have to look somewhere else. There is a number of structures, including core CoordinateMatrix and BlockMatrix, which can distribute data across both dimensions and can be transposed.

answered Oct 06 '22 23:10

zero323

Related questions
                            
                                Scala: pattern match problem with fully qualified classnames in parameterization
                            
                                Scala tail-recursive Stream processor function defined in trait holds reference to stream-head
                            
                                NoClassDefFoundError using Scala class from Java
                            
                                Scala - Co/Contra-Variance as applied to implicit parameter selection
                            
                                Slick issue when going with PostgreSQL
                            
                                Is Comparator a type class?
                            
                                scala 2.10.2 calling a 'macro method' with generic type not work
                            
                                Why new thread instead of future {...}
                            
                                Safely copying fields between case classes of different types
                            
                                Scala type level programming - representing a hierarchy
                            
                                How to print elements of particular RDD partition in Spark?
                            
                                SLICK 3.0 - multiple queries depending on one another - db.run(action)
                            
                                Capturing rules of graph using types in Scala, OCaml and Haskell
                            
                                Partial Functions in Scala
                            
                                How to get IntelliJ to recognize Play Framework *.scala.xml Templates
                            
                                Why shouldn't one use case classes exclusively?
                            
                                how to create DataFrame from multiple arrays in Spark Scala?
                            
                                How to use Spark Structured Streaming with Kafka Direct Stream?
                            
                                How does Scala distinguish between () => T and => T
                            
                                How to use Akka-HTTP client websocket send message

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With