Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add a new column to a Spark RDD?

I have a RDD with MANY columns (e.g., hundreds), how do I add one more column at the end of this RDD?

For example, if my RDD is like below:

    123, 523, 534, ..., 893
    536, 98, 1623, ..., 98472
    537, 89, 83640, ..., 9265
    7297, 98364, 9, ..., 735
    ......
    29, 94, 956, ..., 758

how can I add a column to it, whose value is the sum of the second and the third columns?

Thank you very much.

like image 354
Carter Avatar asked Apr 30 '15 08:04

Carter


People also ask

How do I add a new column in Spark?

In PySpark, to add a new column to DataFrame use lit() function by importing from pyspark. sql. functions import lit , lit() function takes a constant value you wanted to add and returns a Column type, if you wanted to add a NULL / None use lit(None) .

Does RDD have columns?

RDD- Through RDD, we can process structured as well as unstructured data. But, in RDD user need to specify the schema of ingested data, RDD cannot infer its own. DataFrame- In data frame data is organized into named columns.


2 Answers

You do not have to use Tuple* objects at all for adding a new column to an RDD.

It can be done by mapping each row, taking its original contents plus the elements you want to append, for example:

val rdd = ...
val withAppendedColumnsRdd = rdd.map(row => {
  val originalColumns = row.toSeq.toList
  val secondColValue = originalColumns(1).asInstanceOf[Int]
  val thirdColValue = originalColumns(2).asInstanceOf[Int]
  val newColumnValue = secondColValue + thirdColValue 
  Row.fromSeq(originalColumns :+ newColumnValue)
  // Row.fromSeq(originalColumns ++ List(newColumnValue1, newColumnValue2, ...)) // or add several new columns
})
like image 58
Antot Avatar answered Oct 19 '22 15:10

Antot


you have RDD of tuple 4, apply map and convert it to tuple5

val rddTuple4RDD = ...........
val rddTuple5RDD = rddTuple4RDD.map(r=> Tuple5(rddTuple4._1, rddTuple4._2, rddTuple4._3, rddTuple4._4, rddTuple4._2 + rddTuple4._3))
like image 5
banjara Avatar answered Oct 19 '22 17:10

banjara