Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java regex replace all not replacing all words

Tags:

java

regex

I've been playing with this regex in Java for ages and can't get it to work:

(?:^| )(?:the|and|at|in|or|on|off|all|beside|under|over|next)(?: |$)

The following:

pattern.matcher("the cat in the hat").replaceAll(" ")

gives me cat the hat. Another example input is the cat in of the next hat which gives me cat of next hat.

Is there any way I can make this regex replacement work without having to break them out into multiple separate regexes for each word and try to replace a string repeatedly?

like image 409
RTF Avatar asked Apr 16 '15 14:04

RTF


People also ask

How do you replace all words in Java?

To replace all words with another String using Java Regular Expressions, we need to use the replaceAll() method. The replaceAll() method returns a String replacing all the character sequence matching the regular expression and String after replacement.

Does Java string replace replace all?

Using String.String. replace() is used to replace all occurrences of a specific character or substring in a given String object without using regex. There are two overloaded methods available in Java for replace() : String.

Does replace replace all occurrences?

The replaceAll() method will substitute all instances of the string or regular expression pattern you specify, whereas the replace() method will replace only the first occurrence.

What does replaceAll \\ s+ do?

\\s+ --> replaces 1 or more spaces. \\\\s+ --> replaces the literal \ followed by s one or more times.


2 Answers

Yeah, you can do this pretty easily you just need to use boundaries, which is what you're trying to describe with: (?:^| ) Just do this instead:

\b(?:the|and|at|in|or|on|off|all|beside|under|over|next)\b

Your original didn't capture, but as is mentioned in the comments, if you want to capture the options you can use a capturing instead of a non-capturing group:

\b(the|and|at|in|or|on|off|all|beside|under|over|next)\b
like image 150
Jonathan Mee Avatar answered Oct 08 '22 05:10

Jonathan Mee


The problem with yours is that the leading and trailing spaces are included in the matches, and a char cannot be found in two matches.

So with the input the_cat_in_the_hat (the underscores replace the spaces here, to make the explanation clearer):

  1. First match: the_, remaining string: cat_in_the_hat
  2. Second match: _in_, remaining string: the_hat
  3. the is not matched, since it is neither preceded by a space nor by the beginning of the (original) string.

You could have used lookarounds instead, since they behave like conditions (i.e. if):

(?<=^| )(?:the|and|at|in|or|on|off|all|beside|under|over|next)(?= |$)

Regular expression visualization

Debuggex Demo

This way, you would have:

  1. First match: the, remaining string: _cat_in_the_hat
  2. Second match: in, remaining string: _the_hat
  3. Third match: the, remaining string: _hat

But @JonathanMee answer is the best solution, since word boundaries were implemented precisly for this purpose ;)

like image 29
sp00m Avatar answered Oct 08 '22 06:10

sp00m