Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark RDD Split "|"

I am trying to produce a formatted CSV file from pipe("|") delimited file using Apache Spark . input file contains:

apple|ball|cat

Blacktown| Bela vista| Greenacre

x|y|z

I am trying with:

val name= sc.textFile(input.txt")
val split=name.map(line=>line.split("|")).map( x => (x(0),x(2)) )
split.foreach(println)

Output:

(x,y)

(a,p)

(B,a)

My required output is:

(apple,cat)

(Blacktown, Greenacre)

(x,z)

like image 770
Rana Avatar asked Oct 09 '16 20:10

Rana


1 Answers

A String argument for split function is a regular expression so if you want to use pipe it has to be escaped:

line.split("\\|")

otherwise it is interpreted as an alternation between two empty patterns.

You can also use variant which accepts Character literal:

line.split('|')

or an Array of Character literals:

line.split(Array('|'))

It is also better to validate the input:

names.map(_.split("\\|")).collect {
  case Array(x, _, y) => (x, y)
}
like image 121
zero323 Avatar answered Sep 24 '22 00:09

zero323