Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Painfully slow regular expression

Tags:

c#

regex

parsing

I'm trying to parse a formatted email that looks something like this:

From: Mr. Bob Simon Jones
Email: [email protected]
Comments: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris hendrerit, nibh a  tristique malesuada, tellus nibh pharetra mauris, id tincidunt lacus turpis vel risus. 
Vestibulum laoreet venenatis mauris sit amet suscipit. Cras vel pharetra nisl. Suspendisse venenatis ante quis tellus luctus id ornare sem pretium. Cras sodales tristique mauris sagittis ullamcorper. 
Ut sit amet urna magna. Nullam et odio sit amet mauris tempus egestas. Donec eget risus nec lectus adipiscing convallis. Pellentesque in velit enim. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Morbi quis ante diam. Etiam rhoncus leo vulputate ligula luctus volutpat. Praesent luctus, justo eget auctor viverra, diam turpis fringilla elit, non commodo massa arcu et eros. Cras elementum faucibus elit, sollicitudin luctus mi dictum a.
Address: First line, Second Line, Third line pe2 8pd, Fourth Line
Date of Visit: 25/06/2011

I've got a regular expression which works if that's the only text present, but when theres a load of junk text after it goes stupidly slow, when running in a .NET app it doesn't seem to complete at all.

The regular expression is:

.*From: (?<title>Mrs\.|Mr\.|Miss\.|Ms\.) (?<firstName>(\w| )*)(?<=. )(?<surname>(\w| )*)\s*
Email: (?<email>.*)\s*
Comments: (?<comments>(.|\s)*)\s*
Address: (?<address1>[^,]*), (?<address2>[^,]*), (?<address3>[^,]*),(?<address4>.*)\s*
Date of Visit: (?<dateOfVisit>\d\d/\d\d/\d\d\d\d).*

The first line finds every name bar the final one and uses that as the first name and the last name as the surname.

I assume its probably got something to do with this: http://www.regular-expressions.info/catastrophic.html

But I can't quite figure it out. Wondering if anyone might be able to point me in the right direction?

Thanks for your time

like image 552
JeremyBeadle Avatar asked Feb 23 '23 18:02

JeremyBeadle


1 Answers

Yikes. You're trying to do way too much at once. Break it up into smaller pieces:

  1. First, get the raw value of each field. For example, everything that falls between From: and Email: is the name. Don't try to be clever - be blind. The contents of name aren't important yet - just the blob.

  2. Treat each value separately and process it independently as a distinct value with its own rules. Some might be dates, some might be names with a title, etc. You can write a small, simpler regex to suss out this more particular data into a format that makes sense.

like image 174
Rex M Avatar answered Mar 08 '23 11:03

Rex M