finding words with [a-zA-Z] from a sentence using Regex

Question

I'm trying to get all the words in a sentence with regex but only the ones with [a-zA-Z]. So for "I am a boy" I want {"I", "am", "a", "boy"} but for "I a1m a b*y", I want {"I", "a"} because "a1m" and "b*y" includes characters other than [a-zA-Z].

So for me to get words, I'm trying to check

if it's at the beginning of the string, then I only check if there's space after word
else there's a space before and after the word
if it's the last word, then check if there's space before the word.

So I ended up with something like this in Java:

Pattern p = Pattern.compile("^[a-zA-Z]+ |^[a-zA-Z]+$| [a-zA-Z]+$| [a-zA-Z]+");
Matcher m = p.matcher("i am good");
while(m.find()) System.out.println(m.group());

However, I only get "i " and " good". Because when I'm getting "i ", there's one space after "i". So the string left is "am good" Since "am" is not at the beginning of the string, nor does it have a space before the word, it does not get returned.

Can you guys provide any feedback on this? Is there a way to just peek at the next character and not return the space?

Lily Ballard · Accepted Answer

Assuming your regex engine supports lookahead/lookbehind assertions, you can use something like the following:

(^|(?<= )[a-zA-Z]+($|(?= ))

Here's a quick description of what each component does:

(^|(?<= )): This says "if a word starts here, we're interested". Specifically,
^: Match the beginning of the line, or
(?<= ): Match any point that is preceded by a space, without actually consuming the space itself. This is called a positive lookbehind assertion.

[a-zA-Z]+: This should be obvious, but it matches any run of sequential ASCII alphabetic characters.

($|(?= )): This says "if the word is finished here, we're done". Specifically,
$: Match the end of the line, or
(?= ): Match any point that is followed by a space, without actually consuming the space itself. This is called a positive lookahead assertion.

Note that this particular regex doesn't count a word as a word if it's followed by punctuation. This may actually not be what you want, but you described checking for spaces so that's what the regex does. If you want to support words that are followed by simple punctuation you might amend that last atom to be

($|(?=[ .,!?]))

which will match the word if it's followed by a space, period, comma, exclamation mark, or question mark. You can be more elaborate too if you want.

drf · Answer

Could you use a simpler pattern like \b[A-Za-z]+\b instead? (The \b metacharacter separates word characters (e.g., letters) from nonword characters (e.g., spaces and punctuation.))

The code

Pattern p = Pattern.compile("\b[A-Za-z]+\b");
Matcher m = p.matcher("i am good");
while(m.find()) System.out.println(m.group());

Produces {"i", "am", "good"} .

Edit As mathematical.coffee commented, the above fails. The expression

(?<=^|\s)[A-Za-z]+(?=\W*(?:\s*$|\s))

may work better. For the string I a1m a b*y boy am is!! or, matching produces "I", "a", "boy", "am", "is", "or".

If in the previous expression "is!!" should be ignored, the expression (?<=^|\s)[A-Za-z]+(?=$|\s) can be used instead. In the previous example, it does not return "is" but returns the other words (I, a, boy, am, or).

finding words with [a-zA-Z] from a sentence using Regex

Tags:

java

regex

Yoonho Frank Jung

2 Answers

Lily Ballard

drf

Recent Activity

Donate For Us

finding words with [a-zA-Z] from a sentence using Regex

Tags:

java

regex

Yoonho Frank Jung

2 Answers

Lily Ballard

drf

Related questions

Recent Activity

Donate For Us