Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

finding words with [a-zA-Z] from a sentence using Regex

Tags:

java

regex

I'm trying to get all the words in a sentence with regex but only the ones with [a-zA-Z]. So for "I am a boy" I want {"I", "am", "a", "boy"} but for "I a1m a b*y", I want {"I", "a"} because "a1m" and "b*y" includes characters other than [a-zA-Z].

So for me to get words, I'm trying to check

  1. if it's at the beginning of the string, then I only check if there's space after word
  2. else there's a space before and after the word
  3. if it's the last word, then check if there's space before the word.

So I ended up with something like this in Java:

Pattern p = Pattern.compile("^[a-zA-Z]+ |^[a-zA-Z]+$| [a-zA-Z]+$| [a-zA-Z]+");
Matcher m = p.matcher("i am good");
while(m.find()) System.out.println(m.group());

However, I only get "i " and " good". Because when I'm getting "i ", there's one space after "i". So the string left is "am good" Since "am" is not at the beginning of the string, nor does it have a space before the word, it does not get returned.

Can you guys provide any feedback on this? Is there a way to just peek at the next character and not return the space?

like image 746
Yoonho Frank Jung Avatar asked Feb 22 '23 20:02

Yoonho Frank Jung


2 Answers

Assuming your regex engine supports lookahead/lookbehind assertions, you can use something like the following:

(^|(?<= )[a-zA-Z]+($|(?= ))

Here's a quick description of what each component does:

(^|(?<= )): This says "if a word starts here, we're interested". Specifically,
  ^: Match the beginning of the line, or
  (?<= ): Match any point that is preceded by a space, without actually consuming the space itself. This is called a positive lookbehind assertion.

[a-zA-Z]+: This should be obvious, but it matches any run of sequential ASCII alphabetic characters.

($|(?= )): This says "if the word is finished here, we're done". Specifically,
  $: Match the end of the line, or
  (?= ): Match any point that is followed by a space, without actually consuming the space itself. This is called a positive lookahead assertion.


Note that this particular regex doesn't count a word as a word if it's followed by punctuation. This may actually not be what you want, but you described checking for spaces so that's what the regex does. If you want to support words that are followed by simple punctuation you might amend that last atom to be

($|(?=[ .,!?]))

which will match the word if it's followed by a space, period, comma, exclamation mark, or question mark. You can be more elaborate too if you want.

like image 122
Lily Ballard Avatar answered Mar 03 '23 15:03

Lily Ballard


Could you use a simpler pattern like \b[A-Za-z]+\b instead? (The \b metacharacter separates word characters (e.g., letters) from nonword characters (e.g., spaces and punctuation.))

The code

Pattern p = Pattern.compile("\\b[A-Za-z]+\\b");
Matcher m = p.matcher("i am good");
while(m.find()) System.out.println(m.group());

Produces {"i", "am", "good"} .

Edit As mathematical.coffee commented, the above fails. The expression

(?<=^|\s)[A-Za-z]+(?=\W*(?:\s*$|\s))

may work better. For the string I a1m a b*y boy am is!! or, matching produces "I", "a", "boy", "am", "is", "or".

If in the previous expression "is!!" should be ignored, the expression (?<=^|\s)[A-Za-z]+(?=$|\s) can be used instead. In the previous example, it does not return "is" but returns the other words (I, a, boy, am, or).

like image 29
drf Avatar answered Mar 03 '23 16:03

drf