How to transpose an RDD in Spark

Tags:

I have an RDD like this:

1 2 3
4 5 6
7 8 9

It is a matrix. Now I want to transpose the RDD like this:

1 4 7
2 5 8
3 6 9

How can I do this?

765

asked Apr 01 '15 12:04

赵祥宇

1 Answers

Say you have an N×M matrix.

If both N and M are so small that you can hold N×M items in memory, it doesn't make much sense to use an RDD. But transposing it is easy:

val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
val transposed = sc.parallelize(rdd.collect.toSeq.transpose)

If N or M is so large that you cannot hold N or M entries in memory, then you cannot have an RDD line of this size. Either the original or the transposed matrix is impossible to represent in this case.

N and M may be of a medium size: you can hold N or M entries in memory, but you cannot hold N×M entries. In this case you have to blow up the matrix and put it together again:

val rdd = sc.parallelize(Seq(Seq(1, 2, 3), Seq(4, 5, 6), Seq(7, 8, 9)))
// Split the matrix into one number per line.
val byColumnAndRow = rdd.zipWithIndex.flatMap {
  case (row, rowIndex) => row.zipWithIndex.map {
    case (number, columnIndex) => columnIndex -> (rowIndex, number)
  }
}
// Build up the transposed matrix. Group and sort by column index first.
val byColumn = byColumnAndRow.groupByKey.sortByKey().values
// Then sort by row index.
val transposed = byColumn.map {
  indexedRow => indexedRow.toSeq.sortBy(_._1).map(_._2)
}

answered Sep 22 '22 01:09

Daniel Darabos

Related questions
                            
                                How to skip optional parameters in Scala?
                            
                                How can I delegate to a member in Scala?
                            
                                What library should I use for accessing Riak from Scala?
                            
                                Time complexity of JavaConverters asScala method
                            
                                Taming the Scala type system
                            
                                How to compute cumulative sum using Spark
                            
                                How to copy a list in Scala
                            
                                Howto read Excel file in Scala [closed]
                            
                                Actor-based distributed concurrency libraries for Ocaml and other languages [closed]
                            
                                What is "Scala Presentation Compiler"?
                            
                                Override final method
                            
                                SBT is unable to find credentials when attempting to download from an Artifactory virtual repo
                            
                                Why "could not find implicit" error in Scala + Intellij + ScalaTest + Scalactic but not from sbt
                            
                                Type parameter does not extend given type
                            
                                Intellij Idea setup for Scala, clarification needed
                            
                                Understanding the limits of Scala GADT support
                            
                                What are advantages of a Twitter Future over a Scala Future?
                            
                                Declare a Function `type` with `implicit` parameters
                            
                                Scala: Implicit parameter resolution precedence
                            
                                Why has Scala no type-safe equals method?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to transpose an RDD in Spark

Tags:

scala

apache-spark

rdd

赵祥宇

People also ask

1 Answers

Daniel Darabos

Recent Activity

Donate For Us