Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java (Regex) - Get all words in a sentence

I need to split a java string into an array of words. Let's say the string is:

"Hi!! I need to split this string, into a serie's of words?!"

At the moment I'm tried using this String[] strs = str.split("(?!\\w)") however it keeps symbols such as ! in the array and it also keeps strings like "Hi!" in the array as well. The string I am splitting will always be lowercase. What I would like is for an array to be produced that looks like: {"hi", "i", "need", "to", "split", "this", "string", "into", "a", "serie's", "of", "words"} - Note the apostrophe is kept.

How could I change my regex to not include the symbols in the array?

Apologies, I would define a word as a sequence of alphanumeric characters only but with the ' character inclusive if it is in the above context such as "it's", not if it is used to a quote a word such as "'its'". Also, in this context "hi," or "hi-person" are not words but "hi" and "person" are. I hope that clarifies the question.

like image 700
crazyfool Avatar asked Jan 26 '13 17:01

crazyfool


2 Answers

You can remove all ?! symbols and then split into words

str = str.replaceAll("[!?,]", "");
String[] words = str.split("\\s+");

Result:

Hi, I, need, to, split, this, string, into, a, serie's, of, words

like image 119
isvforall Avatar answered Oct 17 '22 18:10

isvforall


Should work for what you want.

String line = "Hi!! I need to split this string, into a serie's of words?! but not '' or ''' word";
String regex = "([^a-zA-Z']+)'*\\1*";
String[] split = line.split(regex);
System.out.println(Arrays.asList(split));

Gives

[Hi, I, need, to, split, this, string, into, a, serie's, of, words, but, not, or, word]
like image 4
Tom Cammann Avatar answered Oct 17 '22 16:10

Tom Cammann