Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running multiple regex patterns on String

Tags:

java

regex

Assuming I have a List<String> and an empty List<Pattern>, is this the best way to handle making the words in the String into Pattern objects;

for(String word : stringList) {
    patterns.add(Pattern.compile("\\b(" + word + ")\\b);
}

And then to run this on a string later;

for(Pattern pattern : patterns) {
    Matcher matcher = pattern.matcher(myString);
    if(matcher.matches()) {
         myString = matcher.replaceAll("String[$1]");
    }
}

The replaceAll bit is just an example, but $1 would be used most of the the time when I use this.

Is there a more efficient way? Because I feel like this is somewhat clunky. I'm using 80 Strings in the list by the way, though the Strings used are configurable, so there won't always be so many.

This is designed to be somewhat of a swearing filter so I'll let you assume the words in the List,

An example of input would be "You're a <curse>", the output would be "You're a *****" for this word, though this may not always be the case and at some point I may be reading from a HashMap<String, String>where the key is the capture group and the value is the replacement.

Example:

if(hashMap.get(matcher.group(1)) == null) { 
    // Can't test if \ is required. Used it here for safe measure.
    matcher.replaceAll("\*\*\*\*");
 } else {
    matcher.replaceAll(hashMap.get(matcher.group(1));
 }
like image 988
Connor Spencer Harries Avatar asked Mar 18 '23 12:03

Connor Spencer Harries


1 Answers

You can join these patterns together using alternation with |:

Pattern pattern = Pattern.compile("\\b(" + String.join("|",stringList) + ")\\b");

If you cannot use Java 8 so do not have the String.join method, or if you need to escape the words to prevent characters in them from being interpreted as regex metacharacters, you will need to build this regex with a manual loop:

StringBuilder regex = new StringBuilder("\\b(");
for (String word : stringList) {
    regex.append(Pattern.quote(word));
    regex.append("|");
}
regex.setLength(regex.length() - 1); // delete last added "|"
regex.append(")\\b");
Pattern pattern = Pattern.compile(regex.toString());

To use different replacements for the different words, you can apply the pattern with this loop:

Matcher m = pattern.matcher(myString);
StringBuilder out = new StringBuilder();
int pos = 0;
while (m.find()) {
    out.append(myString, pos, m.start());
    String matchedWord = m.group(1);
    String replacement = matchedWord.replaceAll(".", "*");
    out.append(replacement);
    pos = m.end();
}
out.append(myString, pos, myString.length());
myString = out.toString();

You can look up the replacement for the matched word any way you like. The example generates a replacement string of asterisks of the same length as the matched word.

like image 119
Boann Avatar answered Mar 27 '23 21:03

Boann