Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scala: Handle tuple where second element of tuple is an array of strings

I have an rdd and the structure of the RDD is as follows:

org.apache.spark.rdd.RDD[(String, Array[String])] = MappedRDD[40] at map at <console>:14

Here is x.take(1) looks like:

Array[(String, Array[String])] = Array((8239427349237423,Array(122641|2|2|1|1421990315711|38|6487985623452037|684|, 1229|2|1|1|1411349089424|87|462966136107937|1568|.....))

For each string in the array I want to split by "|" and take the 6th item and return it with the first element of the tuple as follows:

8239427349237423-6487985623452037
8239427349237423-4629661361079371

I started as follows:

  def getValues(lines: Array[String]) {
    for(line <- lines) {
      line.split("|")(6)
    }

I also tried following:

val b= x.map(a => (a._1, a._2.flatMap(y => y.split("|")(6))))

But that ended up giving me following:

Array[(String, Array[Char])] = Array((8239427349237423,Array(1, 2, 4, |, 9, |, 4, 1, 7, 6, |, 2, 9, 2, 7, 2, |, 7, |,....)))
like image 817
add-semi-colons Avatar asked Dec 12 '25 17:12

add-semi-colons


1 Answers

If you want to do it for the whole x you can use flatMap:

def getValues(x: Array[(String, Array[String])]) =
  x flatMap (line => line._2 map (line._1 + "-" + _.split("\\|")(6)))

Or, maybe a bit more clearly, with for-comprehension:

def getValues(x: Array[(String, Array[String])]) = 
  for {
    (fst, snd) <- x
    line <- snd
  } yield fst + "-" + line.split("\\|")(6)

You have to call split with "\\|" argument, because it takes a regular expression and | is a special symbol, thus you need to escape it. (Edit: or you can use '|' (a Char), as suggested by @BenReich)

To answer your comment, you can modify getValues to take a single element from x as an argument:

def getValues(item: (String, Array[String])) =
  item._2 map (item._1 + "-" + _.split('|')(6))

And then call it with

x flatMap getValues
like image 79
Kolmar Avatar answered Dec 15 '25 22:12

Kolmar