Append a column to Data Frame in Apache Spark 1.3

Tags:

Is it possible and what would be the most efficient neat method to add a column to Data Frame?

More specifically, column may serve as Row IDs for the existing Data Frame.

In a simplified case, reading from file and not tokenizing it, I can think of something as below (in Scala), but it completes with errors (at line 3), and anyways doesn't look like the best route possible:

var dataDF = sc.textFile("path/file").toDF() 
val rowDF = sc.parallelize(1 to DataDF.count().toInt).toDF("ID") 
dataDF = dataDF.withColumn("ID", rowDF("ID"))

454

asked Apr 07 '15 03:04

Oleg Shirokikh

3 Answers

It's been a while since I posted the question and it seems that some other people would like to get an answer as well. Below is what I found.

So the original task was to append a column with row identificators (basically, a sequence 1 to numRows) to any given data frame, so the rows order/presence can be tracked (e.g. when you sample). This can be achieved by something along these lines:

sqlContext.textFile(file).
zipWithIndex().
map(case(d, i)=>i.toString + delimiter + d).
map(_.split(delimiter)).
map(s=>Row.fromSeq(s.toSeq))

Regarding the general case of appending any column to any data frame:

The "closest" to this functionality in Spark API are withColumn and withColumnRenamed. According to Scala docs, the former Returns a new DataFrame by adding a column. In my opinion, this is a bit confusing and incomplete definition. Both of these functions can operate on this data frame only, i.e. given two data frames df1 and df2 with column col:

val df = df1.withColumn("newCol", df1("col") + 1) // -- OK
val df = df1.withColumn("newCol", df2("col") + 1) // -- FAIL

So unless you can manage to transform a column in an existing dataframe to the shape you need, you can't use withColumn or withColumnRenamed for appending arbitrary columns (standalone or other data frames).

As it was commented above, the workaround solution may be to use a join - this would be pretty messy, although possible - attaching the unique keys like above with zipWithIndex to both data frames or columns might work. Although efficiency is ...

It's clear that appending a column to the data frame is not an easy functionality for distributed environment and there may not be very efficient, neat method for that at all. But I think that it's still very important to have this core functionality available, even with performance warnings.

158

answered Oct 16 '22 16:10

Oleg Shirokikh

not sure if it works in spark 1.3 but in spark 1.5 I use withColumn:

import sqlContext.implicits._
import org.apache.spark.sql.functions._


df.withColumn("newName",lit("newValue"))

I use this when I need to use a value that is not related to existing columns of the dataframe

This is similar to @NehaM's answer but simpler

answered Oct 16 '22 16:10

Tal Joffe

I took help from above answer. However, I find it incomplete if we want to change a DataFrame and current APIs are little different in Spark 1.6. zipWithIndex() returns a Tuple of (Row, Long) which contains each row and corresponding index. We can use it to create new Row according to our need.

val rdd = df.rdd.zipWithIndex()
             .map(indexedRow => Row.fromSeq(indexedRow._2.toString +: indexedRow._1.toSeq))
val newstructure = StructType(Seq(StructField("Row number", StringType, true)).++(df.schema.fields))
sqlContext.createDataFrame(rdd, newstructure ).show

I hope this will be helpful.

answered Oct 16 '22 15:10

NehaM

Related questions
                            
                                What types are special to the Scala compiler?
                            
                                How to append or prepend on a Scala mutable.Seq
                            
                                Derive multiple columns from a single column in a Spark DataFrame
                            
                                Scala: How to define "generic" function parameters?
                            
                                Is the Lift framework as "easy" as Ruby on Rails or Django?
                            
                                When to use case class or regular class
                            
                                Scala - Get last two characters from string
                            
                                How to wait for N seconds between statements in Scala?
                            
                                java.lang.NoSuchMethodError: scala.Predef$.refArrayOps
                            
                                Increase JVM heap size for Scala?
                            
                                why is the lift web framework scalable?
                            
                                scala, guidelines on return type - when prefer seq, iterable, traversable
                            
                                How are coroutines implemented in JVM langs without JVM support?
                            
                                What is the relation between Iterable and Iterator?
                            
                                Should I use Unit or leave out the return type for my scala method?
                            
                                Using Either to process failures in Scala code
                            
                                Valid identifier characters in Scala
                            
                                The cost of nested methods
                            
                                scala.concurrent.blocking - what does it actually do?
                            
                                How to apply a function to a tuple?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Append a column to Data Frame in Apache Spark 1.3

Tags:

dataframe

scala

apache-spark

Oleg Shirokikh

People also ask

3 Answers

Oleg Shirokikh

Tal Joffe

NehaM

Recent Activity

Donate For Us