Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex performance issues with possible back tracking?

Tags:

java

regex

I have the following input/output and regex code that works fine (for the below input/output).

-- input --

keep this

      keep this too

     Bye
------ Remove Below ------
  remove all of this

-- output --

keep this

      keep this too

     Bye

-- code --

    String text = "keep this\n       \n"
            + "      keep this too\n      \n     Bye\n------ Remove Below ------\n  remove all of this\n";
    System.out.println(text);
    Pattern PATTERN = Pattern.compile("^(.*?)(-+)(.*?)Remove Below(.*?)(-+)(.*?)$",
             Pattern.DOTALL);
    Matcher m = PATTERN.matcher(text);
    if (m.find()) {
        // remove everything as expected (from about input->regex->output)
        text =  ((m.group(1)).replaceAll("[\n]+$", "")).replaceAll("\\s+$", "");
        System.out.println(m.group(1));
        System.out.println(text);
    }

Ok, so this works great. However, this is for a test with the defined input output. When I get large files that I have to parse that contain the following sequence of characters/patterns I'm seeing the code take a while to execute (4-5sec) per the Find() method on files that are say 100k in size that have the following pattern. In fact sometimes I'm not sure if it's returning or not...when I step though as a debug test the find() method hangs and my client disconnects.

NOTE: There is nothing to match in this file...but this is a pattern that is taxing my regex.

-- 100k file --

junk here
more junk here
o o o (even more junk per the ellipses) 
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses) 
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses) 
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses) 


this repeats from above to make up the 100k file.

-- ASK --

How can I optimize the above regex to handle large file patterns from above as such or is this normal for regex parse speed (4-6sec) let along hanging altogether?

like image 359
genxgeek Avatar asked Feb 24 '26 10:02

genxgeek


2 Answers

Since you're only interested in text above ------ Remove Below ------ line, you don't need to match everything. Just match what you want by shortening your regex and avoid excessive matching and backtracking.

Pattern PATTERN = Pattern.compile("^(.*?)-+ *Remove Below *-+", Pattern.DOTALL);
like image 90
anubhava Avatar answered Feb 26 '26 23:02

anubhava


You are absolutely right, this is a traceback nightmare!

Avoid possible matches when using wildcards. Some tactics, that might help:

if the number of '-' is known use a concrete string to test:

^(.*?)(------ Remove Below ------)(.*)$

or at least a little more concrete

^(.*?)-*-\s*Remove Below\s*--*(.*?)$

be more precise:

^(.*?)(-+)([^-]*)Remove Below([^-]*)(-+)(.*?)$

be greedy if you can:

^(.*)(-+)(.*?)Remove Below(.*?)(-+)(.*?)$

do not include in matches if not needed:

^(.*?)-+.*?Remove Below.*?-+.*?$

of course, depending on your input quality you can combine these concepts:

^(.*)------ Remove Below ------.*$

In your case, parse line by line and when it matches ^.*-+\s*Remove Below\s*-+.*$ stop amending

like image 22
Jan Avatar answered Feb 26 '26 22:02

Jan