Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Split on Spaces and Special Characters

Tags:

java

regex

split

I am trying to split a string on spaces and some specific special characters.

Given the string "john - & + $ ? . @ boy" I want to get the array:

array[0]="john";
array[1]="boy";

I've tried several regular expressions and gotten no where. Here is my current stab:

String[] terms = uglString.split("\\s+|[\\-\\+\\$\\?\\.@&].*");

Which preserves "john" but not "boy". Can anyone get me the rest of this?

like image 724
Jeremiah Adams Avatar asked Jan 10 '14 21:01

Jeremiah Adams


Video Answer


4 Answers

Just use:

String[] terms = input.split("[\\s@&.?$+-]+");

You can put a short-hand character class inside a character class (note the \s), and most meta-character loses their meaning inside a character class, except for [, ], -, &, \. However, & is meaningful only when comes in pair &&, and - is treated as literal character if put at the beginning or the end of the character class.

Other languages may have different rules for parsing the pattern, but the rule about - applies for most of the engines.

As @Sean Patrick Floyd mentioned in his answer, the important thing boils down to defining what constitute a word. \w in Java is equivalent to [a-zA-Z0-9_] (English letters upper and lower case, digits and underscore), and therefore, \W consists of all other characters. If you want to consider Unicode letters and digits, you may want to look at Unicode character classes.

like image 200
nhahtdh Avatar answered Oct 15 '22 21:10

nhahtdh


You could make your code much easier by replacing your pattern with "\\W+" (one or more occurrences of a non-word character. (This way you are whitelisting characters instead of blacklisting, which is usually a good idea)

And of Course things could be made more efficient by using Guava's Splitter class

like image 29
Sean Patrick Floyd Avatar answered Oct 15 '22 19:10

Sean Patrick Floyd


Try out this.....

Input.replace("-&+$?.@"," ").split(" ");
like image 37
awinas kannan Avatar answered Oct 15 '22 20:10

awinas kannan


Breaking then step by step:

For your case, you replace non-word chars (as pointed out). Now you might want to preserve the spaces for an easy String split.

String ugly = "john - & + $ ? . @ boy";
String words = ugly.replaceAll("[^\\w\\s]", "");

There are a lot of spaces in the resulting String which you might want to generally trim to just 1 space:

String formatted = words.trim().replaceAll(" +", " ");

Now you can easily split the String into the words to a String Array:

String[] terms = formatted.split("\\s");
System.out.println(terms[0]);
like image 20
StoopidDonut Avatar answered Oct 15 '22 19:10

StoopidDonut