I have the following input/output and regex code that works fine (for the below input/output).
-- input --
keep this
keep this too
Bye
------ Remove Below ------
remove all of this
-- output --
keep this
keep this too
Bye
-- code --
String text = "keep this\n \n"
+ " keep this too\n \n Bye\n------ Remove Below ------\n remove all of this\n";
System.out.println(text);
Pattern PATTERN = Pattern.compile("^(.*?)(-+)(.*?)Remove Below(.*?)(-+)(.*?)$",
Pattern.DOTALL);
Matcher m = PATTERN.matcher(text);
if (m.find()) {
// remove everything as expected (from about input->regex->output)
text = ((m.group(1)).replaceAll("[\n]+$", "")).replaceAll("\\s+$", "");
System.out.println(m.group(1));
System.out.println(text);
}
Ok, so this works great. However, this is for a test with the defined input output. When I get large files that I have to parse that contain the following sequence of characters/patterns I'm seeing the code take a while to execute (4-5sec) per the Find() method on files that are say 100k in size that have the following pattern. In fact sometimes I'm not sure if it's returning or not...when I step though as a debug test the find() method hangs and my client disconnects.
NOTE: There is nothing to match in this file...but this is a pattern that is taxing my regex.
-- 100k file --
junk here
more junk here
o o o (even more junk per the ellipses)
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses)
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses)
-------------------------------------this is junk
junk here
more junk here
o o o (even more junk per the ellipses)
this repeats from above to make up the 100k file.
-- ASK --
How can I optimize the above regex to handle large file patterns from above as such or is this normal for regex parse speed (4-6sec) let along hanging altogether?
Since you're only interested in text above ------ Remove Below ------ line, you don't need to match everything. Just match what you want by shortening your regex and avoid excessive matching and backtracking.
Pattern PATTERN = Pattern.compile("^(.*?)-+ *Remove Below *-+", Pattern.DOTALL);
You are absolutely right, this is a traceback nightmare!
Avoid possible matches when using wildcards. Some tactics, that might help:
if the number of '-' is known use a concrete string to test:
^(.*?)(------ Remove Below ------)(.*)$
or at least a little more concrete
^(.*?)-*-\s*Remove Below\s*--*(.*?)$
be more precise:
^(.*?)(-+)([^-]*)Remove Below([^-]*)(-+)(.*?)$
be greedy if you can:
^(.*)(-+)(.*?)Remove Below(.*?)(-+)(.*?)$
do not include in matches if not needed:
^(.*?)-+.*?Remove Below.*?-+.*?$
of course, depending on your input quality you can combine these concepts:
^(.*)------ Remove Below ------.*$
In your case, parse line by line and when it matches ^.*-+\s*Remove Below\s*-+.*$ stop amending
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With