Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing punctuation marks form text in Scala - Spark

This is one sample of my data:

case time (especially it's purse), read manual care, follow care instructions make stays waterproof -- example, inspect rubber seals doors (especially battery/memory card door open time) 
xm "life support" picture . flip part bit flimsy guessing won't long . sound great altec speaker dock it! chance back base (xm3020) . traveling bag connect laptop extra speaker . amount paid ($25).

I want remove all punctuation marks except dot(.) and also remove words with length < = 2, for example my expected output is :

case time especially its purse read manual care follow care instructions . make stays waterproof example inspect rubber seals doors especially batterymemory card door open time
life support picture . flip part bit flimsy guessing wont long . sound great altec speaker dock chance back base xm3020 . traveling bag connect laptop extra speaker . amount paid $25 .

and this should be implemented in Scala , i've tried :

replaceAll( """\\W\s""", "")
replaceAll(""""[^a-zA-Z\.]""", "")

but doesn't work well , Can anybody help me?

like image 408
Rozita Avatar asked May 06 '15 10:05

Rozita


People also ask

How do you remove punctuation from a string in Scala?

You can try filtering the string like this: val example = "Hey there! It's me, myself and I." example. filterNot(x => x == ',' || x == '!

How do I remove punctuation in Python NLP?

To get rid of the punctuation, you can use a regular expression or python's isalnum() function. It does work: >>> 'with dot. '. translate(None, string.


2 Answers

You can try filtering the string like this:

val example = "Hey there! It's me, myself and I."
example.filterNot(x => x == ',' || x == '!' || x == 'm')
 res3: String = Hey there It's e yself and I.
like image 124
Duzzz Avatar answered Oct 19 '22 10:10

Duzzz


Looking at the regex javadoc (http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html), we see that the character class for punctuation is \p{Punct} and that we can remove a character from a character class using something as [a-z&&[^def]]. From then it is easy to define a regex that will remove all punctuation except the dot:

s.replaceAll("""[\p{Punct}&&[^.]]""", "")

Removing words with size <= 2 could be done like so:

s.replaceAll("""\b\p{IsLetter}{1,2}\b""")

Combining the two, this gives:

s.replaceAll("""([\p{Punct}&&[^.]]|\b\p{IsLetter}{1,2}\b)\s*""", "")

Note how I added \s* to remove redundant spaces.

Also, you can see that the above regex entirely removes '$', because it is a punctuation character (as defined by unicode). If that is undesirable (as seems to indicate your expected output), please be more precise in what you consider punctuation. By example you might want to consider only the following characters as punctuation: ?.!:():

s.replaceAll("""([?.!:]|\b\p{IsLetter}{1,2}\b)\s*""", "")

Alternatively, you could just add '$' to your "not-punctuation" character-list, along with the dot:

s.replaceAll("""([\p{Punct}&&[^.$]]|\b\p{IsLetter}{1,2}\b)\s*""", "")
like image 31
Régis Jean-Gilles Avatar answered Oct 19 '22 11:10

Régis Jean-Gilles