Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

generating a regular expression from a string

Tags:

java

regex

I wish to generate a regular expression from a string containing numbers, and then use this as a Pattern to search for similar strings. Example:

String s = "Page 3 of 23"

If I substitute all digits by \d

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (Character.isDigit(c)) {
        sb.append("\\d"); // backslash d
    } else {
        sb.append(c);
        }
    }

    Pattern numberPattern = Pattern.compile(sb.toString());

//    Pattern numberPattern = Pattern.compile("Page \d of \d\d");

I can use this to match similar strings (e.g. "Page 7 of 47"). My problem is that if I do this naively some of the metacharacters such as (){}-, etc. will not be escaped. Is there a library to do this or an exhaustive set of characters for regular expressions which I must and must not escape? (I can try to extract them from the Javadocs but am worried about missing something).

Alternatively is there a library which already does this (I don't at this stage want to use a full Natural Language Processing solution).

NOTE: @dasblinkenlight's edited answer now works for me!

like image 634
peter.murray.rust Avatar asked Apr 16 '13 10:04

peter.murray.rust


People also ask

What is regex generator?

"Regex Generator is a simple web interface to generate regular expressions from a set of strings."

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .


1 Answers

Java's regexp library provides this functionality:

String s = Pattern.quote(orig);

The "quoted" string will have all its metacharacters escaped. First, escape your string, and then go through it and replace digits by \d to make a regular expression. Since regex library uses \Q and \E for quoting, you need to enclose your portion of regex in inverse quotes of \E and \Q.

One thing I would change in your implementation is the replacement algorithm: rather than replacing character-by-character, I would replace digits in groups. This would let an expression produced from Page 3 of 23 match strings like Page 13 of 23 and Page 6 of 8.

String p = Pattern.quote(orig).replaceAll("\\d+", "\\\\E\\\\d+\\\\Q");

This would produce "\QPage \E\d+\Q of \E\d+\Q\E" no matter what page numbers and counts were there originally. The output needs only one, not two slashes in \d, because the result is fed directly to regex engine, bypassing the Java compiler.

like image 52
Sergey Kalinichenko Avatar answered Sep 20 '22 14:09

Sergey Kalinichenko