Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert RDD[(String, String)] into RDD[Array[String]]?

I am trying to append filename to each record in the file. I thought if the RDD is Array it would have been easy for me to do it.

Some help with converting RDD type or solving this problem would be much appreciated!

In (String, String) type

scala> myRDD.first()(1)    
scala><console>:24: error: (String, String) does not take parametersmyRDD.first()(1)  

In Array(string)

scala> myRDD.first()(1)    
scala> res1: String = abcdefgh

My function:

def appendKeyToValue(x: Array[Array[String]){
    for (i<-0 to (x.length - 1)) {
        var key = x(i)(0)
        val pattern = new Regex("\\.")
        val key2 = pattern replaceAllIn(key1,"|")
        var tempvalue = x(i)(1)
        val finalval = tempvalue.split("\n")
        for (ab <-0 to (finalval.length -1)){
            val result = (I am trying to append filename to each record in the filekey2+"|"+finalval(ab))
            }  
        }
}
like image 371
WoodChopper Avatar asked Sep 15 '15 10:09

WoodChopper


People also ask

How you will convert RDD into data frame and datasets?

Convert Using createDataFrame Method This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema. We can observe the column names are following a default sequence of names based on a default template.

How do I convert RDD to dataset in PySpark?

Converting Spark RDD to DataFrame can be done using toDF(), createDataFrame() and transforming rdd[Row] to the data frame.

What is the name of method used to get an RDD from a simple collection?

RDDs can be created generally by the parallelizing method. It is possible by taking an existing collection from our driver program. Driver program such as Scala, Python, Java. Also by calling the sparkcontext's parallelize( ) method on it.


1 Answers

If you have a RDD[(String, String)], you can access the first tuple field of the first tuple by calling

val firstTupleField: String = myRDD.first()._1

If you want to convert a RDD[(String, String)] into a RDD[Array[String]] you can do the following

val arrayRDD: RDD[Array[String]] = myRDD.map(x => Array(x._1, x._2))

You may also employ a partial function to destructure the tuples:

val arrayRDD: RDD[Array[String]] = myRDD.map { case (a,b) => Array(a, b) }
like image 151
Till Rohrmann Avatar answered Sep 18 '22 15:09

Till Rohrmann