Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bug in Mathematica: regular expression applied to very long string

In the following code, if the string s is appended to be something like 10 or 20 thousand characters, the Mathematica kernel seg faults.

s = "This is the first line.
MAGIC_STRING
Everything after this line should get removed.
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890123456789012345678901234567890
...";

s = StringReplace[s, RegularExpression@"(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*"->""]

I think this is primarily Mathematica's fault and I've submitted a bug report and will follow up here if I get a response. But I'm also wondering if I'm doing this in a stupid/inefficient way. And even if not, ideas for working around Mathematica's bug would be appreciated.

like image 785
dreeves Avatar asked Feb 13 '10 14:02

dreeves


3 Answers

Mathematica uses PCRE syntax, so it does have the /s aka DOTALL aka Singleline modifier, you just prepend the (?s) modifier before the part of the expression in which you want it to apply.

See the RegularExpression documentation here: (expand the section labeled "More Information")
http://reference.wolfram.com/mathematica/ref/RegularExpression.html

The following set options for all regular expression elements that follow them:
(?i) treat uppercase and lowercase as equivalent (ignore case)
(?m) make ^ and $ match start and end of lines (multiline mode)
(?s) allow . to match newline
(?-c) unset options

This modified input doesn't crash Mathematica 7.0.1 for me (the original did), using a string that is 15,000 characters long, producing the same output as your expression:

s = StringReplace[s,RegularExpression@".*MAGIC_STRING(?s).*"->""]

It should also be a bit faster for the reasons @AlanMoore explained

like image 51
Michael Pilat Avatar answered Nov 17 '22 20:11

Michael Pilat


The best way to optimize the regex depends on the internals of Mathematica's regex engine, but I would definitely get rid of the (.|\\n)*, as @Simon mentioned. It's not just the alternation--although it's almost always a mistake to have an alternation in which every alternative matches exactly one character; that's what character classes are for. But you're also capturing each character when you match it (because of the parentheses), only to throw it away when you match the next character.

A quick scan of the Mathematica regex docs doesn't turn up anything like the /s (Singleline or DOTALL) modifier, so I recommend the old JavaScript standby, [\\s\\S]* -- match anything that is whitespace or anything that isn't whitespace. Also, it might help to add the $ anchor to the end of the regex:

"(^|\\n)[^\\n]*MAGIC_STRING[\\s\\S]*$"

But your best option would probably be not to use regexes at all. I don't see anything here that requires them, and it would probably be much easier as well as more efficient to use Mathematica's normal string-manipulation functions.

like image 42
Alan Moore Avatar answered Nov 17 '22 20:11

Alan Moore


Mathematica is a great executive toy but I'd advise against trying to do anything serious with it like regexs over long strings or any kind of computation over significant amounts of data (or where correctness is important). Use something tried and tested. Visual F# 2010 takes 5 milliseconds and one line of code to get the correct answer without crashing:

> let str =
    "This is the first line.\nMAGIC_STRING\nEverything after this line should get removed." +
      String.replicate 2000 "0123456789";;
val str : string =
  "This is the first line.
MAGIC_STRING
Everything after this li"+[20022 chars]

> open System.Text.RegularExpressions;;
> #time;;
--> Timing now on

> (Regex "(^|\\n)[^\\n]*MAGIC_STRING(.|\\n)*").Replace(str, "");;
Real: 00:00:00.005, CPU: 00:00:00.015, GC gen0: 0, gen1: 0, gen2: 0
val it : string = "This is the first line."
like image 2
J D Avatar answered Nov 17 '22 20:11

J D