DataFrame-ified zipWithIndex

Tags:

I am trying to solve the age-old problem of adding a sequence number to a data set. I am working with DataFrames, and there appears to be no DataFrame equivalent to RDD.zipWithIndex. On the other hand, the following works more or less the way I want it to:

val origDF = sqlContext.load(...)      val seqDF= sqlContext.createDataFrame(     origDF.rdd.zipWithIndex.map(ln => Row.fromSeq(Seq(ln._2) ++ ln._1.toSeq)),     StructType(Array(StructField("seq", LongType, false)) ++ origDF.schema.fields) )

In my actual application, origDF won't be loaded directly out of a file -- it is going to be created by joining 2-3 other DataFrames together and will contain upwards of 100 million rows.

Is there a better way to do this? What can I do to optimize it?

493

asked May 18 '15 13:05

David Griffin

2 Answers

The following was posted on behalf of the David Griffin (edited out of question).

The all-singing, all-dancing dfZipWithIndex method. You can set the starting offset (which defaults to 1), the index column name (defaults to "id"), and place the column in the front or the back:

import org.apache.spark.sql.DataFrame import org.apache.spark.sql.types.{LongType, StructField, StructType} import org.apache.spark.sql.Row   def dfZipWithIndex(   df: DataFrame,   offset: Int = 1,   colName: String = "id",   inFront: Boolean = true ) : DataFrame = {   df.sqlContext.createDataFrame(     df.rdd.zipWithIndex.map(ln =>       Row.fromSeq(         (if (inFront) Seq(ln._2 + offset) else Seq())           ++ ln._1.toSeq ++         (if (inFront) Seq() else Seq(ln._2 + offset))       )     ),     StructType(       (if (inFront) Array(StructField(colName,LongType,false)) else Array[StructField]())          ++ df.schema.fields ++        (if (inFront) Array[StructField]() else Array(StructField(colName,LongType,false)))     )   )  }

166

answered Sep 29 '22 20:09

4 revs, 4 users 87%

Since Spark 1.6 there is a function called monotonically_increasing_id()
It generates a new column with unique 64-bit monotonic index for each row
But it isn't consequential, each partition starts a new range, so we must calculate each partition offset before using it.
Trying to provide an "rdd-free" solution, I ended up with some collect(), but it only collects offsets, one value per partition, so it will not cause OOM

def zipWithIndex(df: DataFrame, offset: Long = 1, indexName: String = "index") = {     val dfWithPartitionId = df.withColumn("partition_id", spark_partition_id()).withColumn("inc_id", monotonically_increasing_id())      val partitionOffsets = dfWithPartitionId         .groupBy("partition_id")         .agg(count(lit(1)) as "cnt", first("inc_id") as "inc_id")         .orderBy("partition_id")         .select(sum("cnt").over(Window.orderBy("partition_id")) - col("cnt") - col("inc_id") + lit(offset) as "cnt" )         .collect()         .map(_.getLong(0))         .toArray               dfWithPartitionId         .withColumn("partition_offset", udf((partitionId: Int) => partitionOffsets(partitionId), LongType)(col("partition_id")))         .withColumn(indexName, col("partition_offset") + col("inc_id"))         .drop("partition_id", "partition_offset", "inc_id") }

This solution doesn't repack the original rows and doesn't repartition the original huge dataframe, so it is quite fast in real world: 200GB of CSV data (43 million rows with 150 columns) read, indexed and packed to parquet in 2 minutes on 240 cores
After testing my solution, I have run Kirk Broadhurst's solution and it was 20 seconds slower
You may want or not want to use dfWithPartitionId.cache(), depends on task

answered Sep 29 '22 19:09

Evgeny Glotov

Related questions
                            
                                In Scala; should I use the App trait?
                            
                                Using SBT from Scala IDE
                            
                                Using a Java library with Scala reserved words
                            
                                Why does one select Scala type members with a hash instead of a dot?
                            
                                Private scoping with square brackets (private[...]) in Scala
                            
                                Future[Option] in Scala for-comprehensions
                            
                                Passing function as block of code between curly braces
                            
                                'Unable to load a Suite class' while running ScalaTest in IntelliJ
                            
                                parsing a Json Array in play framework JsObject
                            
                                Eclipse, Android, Scala made easy but still does not work
                            
                                FoldLeft using FoldRight in scala
                            
                                How can I use a Scala singleton object in Java?
                            
                                Map the Exception of a failed Future
                            
                                how do I get sbt to gather all the jar files my code depends on into one place?
                            
                                Scala 2.10 + Json serialization and deserialization
                            
                                Hibernate and Scala [closed]
                            
                                Scala in OSGI container?
                            
                                Converting mutable collection to immutable
                            
                                remove first and last Element from scala.collection.immutable.Iterable[String]
                            
                                Spark Scala list folders in directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

DataFrame-ified zipWithIndex

Tags:

scala

apache-spark

apache-spark-sql

David Griffin

People also ask

2 Answers

4 revs, 4 users 87%

Evgeny Glotov

Recent Activity

Donate For Us