val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val tokenizedLines = lines.map(Tokenizer.tokenize)
in the above code snippet, the tokenize function may return empty strings. How do i skip adding it to the map in that case? or remove empty entries post adding to map?
tokenizedLines.filter(_.nonEmpty)
The currently accepted answer, using filter and nonEmpty, incurs some performance penalty because nonEmpty is not a method on String, but, instead, it's added through implicit conversion. With value objects being used, I expect the difference to be almost imperceptible, but on versions of Scala where that is not the case, it is a substantial hit.
Instead, one could use this, which is assured to be faster:
tokenizedLines.filterNot(_.isEmpty)
You could use flatMap with Option.
Something like that:
lines.flatMap{
case "" => None
case s => Some(s)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With