Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Scala How to use replace function in RDD

Tags:

I am having a tweet file

396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
396124437168537600,"I really wish I didn't give up everything I did for you.     I'm so mad at my self for even letting it get as far as it did.",savava143
396124436958412800,"I really need to double check who I'm sending my     snapchats to before sending it 😩😭",juliannpham
396124437218885632,"@Darrin_myers30 I feel you man, gotta stay prayed up.     Year is important",Ful_of_Ambition
396124437558611968,"tell me what I did in my life to deserve this.",_ItsNotBragging
396124437499502592,"Too many fine men out here...see me drooling",LolaofLife
396124437722198016,"@jaiclynclausen will do",I_harley99

I am trying to replace all special character after reading file into RDD,

    val fileReadRdd = sc.textFile(fileInput)
    val fileReadRdd2 = fileReadRdd.map(x => x.map(_.replace(","," ")))
    val fileFlat = fileReadRdd.flatMap(rec => rec.split(" "))

I am getting following error

Error:(41, 57) value replace is not a member of Char
    val fileReadRdd2 = fileReadRdd.map(x => x.map(_.replace(",","")))
like image 902
Ravinder Karra Avatar asked Mar 20 '17 16:03

Ravinder Karra


People also ask

How do I use the Replace function in Spark?

By using regexp_replace() Spark function you can replace a column's string value with another string/substring. regexp_replace() uses Java regex for matching, if the regex does not match it returns an empty string. The below example replaces the street name Rd value with Road string on address column.

Which method can be used in Spark to convert a Scala collection into a RDD?

Using Parallelized collection It is possible by taking an existing collection from our driver program. Driver program such as Scala, Python, Java. Also by calling the sparkcontext's parallelize( ) method on it. This is a basic method to create RDD which is applied at the very initial stage of spark.

Is reduce an action in RDD?

On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

Can data in RDD be changed once RDD is created?

RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD.


1 Answers

I suspect:

x => x.map(_.replace(",",""))

is treating your string as a sequence of characters, and you actually want

x => x.replace(",", "")

(i.e. you don't need to map over the 'sequence' of chars)

like image 133
Brian Agnew Avatar answered Sep 25 '22 10:09

Brian Agnew