The question is a bit complicated, and googling didn't really help. I will try to put in only relevant aspects of it. I have a large document in approximately the following format: Sample Input: <pre class="prettyprint"><code>ABC is a word from one line of this document. It is followed by some random line PQR which happens to be another word. This is just another line I have to fix my regular expression. Here GHI appears in the middle. This may be yet another line. VWX is a line this is the last line </code></pre> I am trying to remove the section of the text according to the below: <ul> <li>From either of: <ul> <li>ABC</li> <li>DEF</li> <li>GHI</li> </ul> </li> <li>To either of(while retaining this word): <ul> <li>PQR</li> <li>STU</li> <li>VWX</li> </ul> </li> </ul> The words that make up "From" can appear anywhere in a line (Look at GHI). But for removal the entire line needs to be removed. (The entire line containing GHI needs to be removed as in the sample output below) Sample Output: <pre class="prettyprint"><code>PQR which happens to be another word. This is just another line I have to fix my regular expression. VWX is a line this is the last line </code></pre> The above example actually seemed easy for me until I ran it against very large input files ( 49KB) What I have tried: The regular expression I am currently using is (with case insensitive and multiline modifier): <pre class="prettyprint"><code>^.*\b(abc|def|ghi)\b(.|\s)*?\b(pqr|stu|vwx)\b </code></pre> Problem The above regexp works wonderfully on small text files. But fails/crashes the engine on large files. I have tried it against the below: <ul> <li>V8 (Node.js) : Hangs</li> <li>Rhino : Hangs</li> <li>Python : Hangs</li> <li>Java : <code>StackoverflowError</code> (Stack trace posted at the end of this question)</li> <li>IonMonkey (Firefox) : WORKS! </li> </ul> Actual Input: <ul> <li>My original Input: http://ideone.com/W4sZmB </li> <li> My regular expression (split across multiple lines for clarity): <pre class="prettyprint"><code>^.*\\b(patient demographics|electronically signed|md|rn|mspt|crnp|rt)\\b (.|\\s)*? \\b(history of present illness|hpi|chief complaint|cc|reason for consult|patientis|inpatient is|inpatientpatient|pt is|pts are|start end frequency user)\\b </code></pre> </li> </ul> Question: <ul> <li>Is my regular expression correct? Can it be optimized further to avoid this problem?</li> <li>In case it is correct, why do other engines hang infinitely? A section of stack trace is below:</li> </ul> Stack Trace: <pre class="prettyprint"><code>Exception in thread "main" java.lang.StackOverflowError at java.util.regex.Pattern$GroupTail.match(Pattern.java:4218) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345) at java.util.regex.Pattern$Branch.match(Pattern.java:4114) at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168) at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357) at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227) at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078) </code></pre> PS: I'm adding several tags to this question since I have tried it on those environments and the experiment failed.

The problem is the (.|\s)* because any space character will match both and it will allow it to go down both options. This makes it get exponentially larger. You can see the issue with this regex in ruby <pre class="prettyprint"><code>str = "b" + "a" * 200 + "cbab" /b(a|a)*b/.match str </code></pre> which takes forever, while a basically identical one <pre class="prettyprint"><code>/ba*b/.match str </code></pre> matches quickly. You can fix this by either using just <code>.*</code> or if <code>.</code> doesn't match newlines <code>(.|\n)*</code>

Node.JS Regex engine fails on large input

Tags:

java

python

regex

node.js

v8

The question is a bit complicated, and googling didn't really help. I will try to put in only relevant aspects of it.

I have a large document in approximately the following format:

Sample Input:

ABC is a word from one line of this document. It is followed by
some random line
PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
Here GHI appears in the middle.
This may be yet another line.
VWX is a line
this is the last line

I am trying to remove the section of the text according to the below:

From either of:
- ABC
- DEF
- GHI
To either of(while retaining this word):
- PQR
- STU
- VWX

The words that make up "From" can appear anywhere in a line (Look at GHI). But for removal the entire line needs to be removed. (The entire line containing GHI needs to be removed as in the sample output below)

Sample Output:

PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
VWX is a line
this is the last line

The above example actually seemed easy for me until I ran it against very large input files ( 49KB)

What I have tried:

The regular expression I am currently using is (with case insensitive and multiline modifier):

^.*\b(abc|def|ghi)\b(.|\s)*?\b(pqr|stu|vwx)\b

Problem

The above regexp works wonderfully on small text files. But fails/crashes the engine on large files. I have tried it against the below:

V8 (Node.js) : Hangs
Rhino : Hangs
Python : Hangs
Java : StackoverflowError (Stack trace posted at the end of this question)
IonMonkey (Firefox) : WORKS!

Actual Input:

My original Input: http://ideone.com/W4sZmB

My regular expression (split across multiple lines for clarity):

^.*\\b(patient demographics|electronically signed|md|rn|mspt|crnp|rt)\\b
 (.|\\s)*?
 \\b(history of present illness|hpi|chief complaint|cc|reason for consult|patientis|inpatient is|inpatientpatient|pt is|pts are|start end frequency user)\\b

Question:

Is my regular expression correct? Can it be optimized further to avoid this problem?
In case it is correct, why do other engines hang infinitely? A section of stack trace is below:

Stack Trace:

Exception in thread "main" java.lang.StackOverflowError
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4218)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
    at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)

^{PS: I'm adding several tags to this question since I have tried it on those environments and the experiment failed.}

539

asked May 16 '13 06:05

UltraInstinct

1 Answers

The problem is the (.|\s)* because any space character will match both and it will allow it to go down both options. This makes it get exponentially larger.

You can see the issue with this regex in ruby

str = "b" + "a" * 200 + "cbab"

/b(a|a)*b/.match str

which takes forever, while a basically identical one

/ba*b/.match str

matches quickly.

You can fix this by either using just .* or if . doesn't match newlines (.|\n)*

answered Oct 06 '22 15:10

user71404

Related questions
                            
                                How to programmatically insert call log entries WITH display name and photo?
                            
                                Storing and editing configuration for Java EE applications
                            
                                How to encrypt SOAP messages manually?
                            
                                Combine Javadoc for multiple modules into a single collection
                            
                                Hunting memory leaks, VisualVM: "No GC root found". What's next?
                            
                                Eliminating Singletons
                            
                                What approach of improving incremental building of the maven projects do you prefer?
                            
                                The Free energy approximation Equation in Restriction Boltzmann Machines
                            
                                How can I get code coverage of an external java library with jacoco?
                            
                                Reinstall application apk programmatically without downloading
                            
                                Ok to have stack depth linearly proportional to some input size?
                            
                                Java System Tray Icon on MacOS 10.7.4 not showing animated gif
                            
                                "Auth fail" in jsch-0.1.42 with Java 1.4.2
                            
                                How do I set up a Jcurses library?
                            
                                Java use SSL for database connection
                            
                                How to Combine Multiple Jars into One?
                            
                                How to get all terms for a Lucene field in Lucene 4
                            
                                Redirecting an ip in Java
                            
                                Could AKKA remoted actors be a used in a p2p swarm context?
                            
                                Advanced Eclipse Java Search

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With