Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing empty strings from maps in scala

val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val tokenizedLines = lines.map(Tokenizer.tokenize)

in the above code snippet, the tokenize function may return empty strings. How do i skip adding it to the map in that case? or remove empty entries post adding to map?

like image 800
Siva Avatar asked Nov 05 '14 09:11

Siva


3 Answers

tokenizedLines.filter(_.nonEmpty)

like image 176
axmrnv Avatar answered Nov 19 '22 15:11

axmrnv


The currently accepted answer, using filter and nonEmpty, incurs some performance penalty because nonEmpty is not a method on String, but, instead, it's added through implicit conversion. With value objects being used, I expect the difference to be almost imperceptible, but on versions of Scala where that is not the case, it is a substantial hit.

Instead, one could use this, which is assured to be faster:

tokenizedLines.filterNot(_.isEmpty)
like image 11
Daniel C. Sobral Avatar answered Nov 19 '22 13:11

Daniel C. Sobral


You could use flatMap with Option.

Something like that:

lines.flatMap{
     case "" => None 
     case s => Some(s)
}
like image 1
crak Avatar answered Nov 19 '22 14:11

crak