I am trying to split a string on spaces and some specific special characters.
Given the string "john - & + $ ? . @ boy" I want to get the array:
array[0]="john";
array[1]="boy";
I've tried several regular expressions and gotten no where. Here is my current stab:
String[] terms = uglString.split("\\s+|[\\-\\+\\$\\?\\.@&].*");
Which preserves "john" but not "boy". Can anyone get me the rest of this?
Just use:
String[] terms = input.split("[\\s@&.?$+-]+");
You can put a short-hand character class inside a character class (note the \s
), and most meta-character loses their meaning inside a character class, except for [
, ]
, -
, &
, \
. However, &
is meaningful only when comes in pair &&
, and -
is treated as literal character if put at the beginning or the end of the character class.
Other languages may have different rules for parsing the pattern, but the rule about -
applies for most of the engines.
As @Sean Patrick Floyd mentioned in his answer, the important thing boils down to defining what constitute a word. \w
in Java is equivalent to [a-zA-Z0-9_]
(English letters upper and lower case, digits and underscore), and therefore, \W
consists of all other characters. If you want to consider Unicode letters and digits, you may want to look at Unicode character classes.
You could make your code much easier by replacing your pattern with "\\W+"
(one or more occurrences of a non-word character. (This way you are whitelisting characters instead of blacklisting, which is usually a good idea)
And of Course things could be made more efficient by using Guava's Splitter
class
Try out this.....
Input.replace("-&+$?.@"," ").split(" ");
Breaking then step by step:
For your case, you replace non-word chars (as pointed out). Now you might want to preserve the spaces for an easy String split.
String ugly = "john - & + $ ? . @ boy";
String words = ugly.replaceAll("[^\\w\\s]", "");
There are a lot of spaces in the resulting String which you might want to generally trim to just 1 space:
String formatted = words.trim().replaceAll(" +", " ");
Now you can easily split the String into the words to a String Array:
String[] terms = formatted.split("\\s");
System.out.println(terms[0]);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With