I'm trying to get all the words in a sentence with regex but only the ones with [a-zA-Z]. So for "I am a boy" I want {"I", "am", "a", "boy"} but for "I a1m a b*y", I want {"I", "a"} because "a1m" and "b*y" includes characters other than [a-zA-Z].
So for me to get words, I'm trying to check
So I ended up with something like this in Java:
Pattern p = Pattern.compile("^[a-zA-Z]+ |^[a-zA-Z]+$| [a-zA-Z]+$| [a-zA-Z]+");
Matcher m = p.matcher("i am good");
while(m.find()) System.out.println(m.group());
However, I only get "i " and " good". Because when I'm getting "i ", there's one space after "i". So the string left is "am good" Since "am" is not at the beginning of the string, nor does it have a space before the word, it does not get returned.
Can you guys provide any feedback on this? Is there a way to just peek at the next character and not return the space?
Assuming your regex engine supports lookahead/lookbehind assertions, you can use something like the following:
(^|(?<= )[a-zA-Z]+($|(?= ))
Here's a quick description of what each component does:
(^|(?<= ))
: This says "if a word starts here, we're interested". Specifically,
^
: Match the beginning of the line, or
(?<= )
: Match any point that is preceded by a space, without actually consuming the space itself. This is called a positive lookbehind assertion.
[a-zA-Z]+
: This should be obvious, but it matches any run of sequential ASCII alphabetic characters.
($|(?= ))
: This says "if the word is finished here, we're done". Specifically,
$
: Match the end of the line, or
(?= )
: Match any point that is followed by a space, without actually consuming the space itself. This is called a positive lookahead assertion.
Note that this particular regex doesn't count a word as a word if it's followed by punctuation. This may actually not be what you want, but you described checking for spaces so that's what the regex does. If you want to support words that are followed by simple punctuation you might amend that last atom to be
($|(?=[ .,!?]))
which will match the word if it's followed by a space, period, comma, exclamation mark, or question mark. You can be more elaborate too if you want.
Could you use a simpler pattern like \b[A-Za-z]+\b
instead? (The \b metacharacter separates word characters (e.g., letters) from nonword characters (e.g., spaces and punctuation.))
The code
Pattern p = Pattern.compile("\\b[A-Za-z]+\\b");
Matcher m = p.matcher("i am good");
while(m.find()) System.out.println(m.group());
Produces {"i", "am", "good"} .
Edit As mathematical.coffee commented, the above fails. The expression
(?<=^|\s)[A-Za-z]+(?=\W*(?:\s*$|\s))
may work better. For the string I a1m a b*y boy am is!! or
, matching produces "I", "a", "boy", "am", "is", "or".
If in the previous expression "is!!" should be ignored, the expression (?<=^|\s)[A-Za-z]+(?=$|\s)
can be used instead. In the previous example, it does not return "is" but returns the other words (I, a, boy, am, or).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With