The question is a bit complicated, and googling didn't really help. I will try to put in only relevant aspects of it.
I have a large document in approximately the following format:
Sample Input:
ABC is a word from one line of this document. It is followed by
some random line
PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
Here GHI appears in the middle.
This may be yet another line.
VWX is a line
this is the last line
I am trying to remove the section of the text according to the below:
The words that make up "From" can appear anywhere in a line (Look at GHI). But for removal the entire line needs to be removed. (The entire line containing GHI needs to be removed as in the sample output below)
Sample Output:
PQR which happens to be another word.
This is just another line
I have to fix my regular expression.
VWX is a line
this is the last line
The above example actually seemed easy for me until I ran it against very large input files ( 49KB)
What I have tried:
The regular expression I am currently using is (with case insensitive and multiline modifier):
^.*\b(abc|def|ghi)\b(.|\s)*?\b(pqr|stu|vwx)\b
Problem
The above regexp works wonderfully on small text files. But fails/crashes the engine on large files. I have tried it against the below:
StackoverflowError
(Stack trace posted at the end of this question)Actual Input:
My regular expression (split across multiple lines for clarity):
^.*\\b(patient demographics|electronically signed|md|rn|mspt|crnp|rt)\\b
(.|\\s)*?
\\b(history of present illness|hpi|chief complaint|cc|reason for consult|patientis|inpatient is|inpatientpatient|pt is|pts are|start end frequency user)\\b
Question:
Stack Trace:
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4218)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4168)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4357)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4227)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4078)
PS: I'm adding several tags to this question since I have tried it on those environments and the experiment failed.
The reason the regex is so slow is that the "*" quantifier is greedy by default, and so the first ". *" tries to match the whole string, and after that begins to backtrack character by character. The runtime is exponential in the count of numbers on a line.
Good regular expressions are often longer than bad regular expressions because they make use of specific characters/character classes and have more structure. This causes good regular expressions to run faster as they predict their input more accurately.
Backtracking occurs when a regular expression pattern contains optional quantifiers or alternation constructs, and the regular expression engine returns to a previous saved state to continue its search for a match.
Multiple clients make multiple requests to the NodeJS server. NodeJS receives these requests and places them into the EventQueue . NodeJS server has an internal component referred to as the EventLoop which is an infinite loop that receives requests and processes them.
The problem is the (.|\s)* because any space character will match both and it will allow it to go down both options. This makes it get exponentially larger.
You can see the issue with this regex in ruby
str = "b" + "a" * 200 + "cbab"
/b(a|a)*b/.match str
which takes forever, while a basically identical one
/ba*b/.match str
matches quickly.
You can fix this by either using just .*
or if .
doesn't match newlines (.|\n)*
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With