I have a very long regular expression. My regex is a combination of around 5000 or more phrases.
Also, the text on which I am executing the regex is also huge. Text size is around 5kb.
Because regex as well as the input text is huge, it takes minimum 2 minutes to execute the regex which is not acceptable in my project.
So, I would like to know how I can optimize this. One way I can think of is to split the regex and use multiple threads to minimize the execution time. Is this the correct option or is there any other way?
Part of my regex looks like this :
(ACS|ADDR.com Technologies|ADP private limited|ADP|ADP India private limited|AIT Software Services PTE limited|AMK Technologies private limited|ANMSoft Technologies private limited|ANZ Information Technology private limited|ASD Global India private Limited|ASD India private Limited|ASM Technologies private limited|AXA Group Solutions India private limited|AXA technology India limited|Aarkay Infonet private limited|AbsolutData Research and Analytics private limited|Accenture India private limited|Accenture Services India|Accenture Services P Limited|Accenture Services private Limited|Accenture|Accenture Software Private Limited|Accurum India private limited|AceTechnologies Inc|Aclat Inc|AcmeCeeYess Softech Private Limited|Adaequare India private limited|Adaequare Info private limited|Adea International private limited|Adea Technologies|Adeptra|Aditi Technologies|Adobe Systems|Adroit Business Solutions|Adroit and Claretdene Infotech private limited|Affron Infotech|Agile Software Enterprise private limited|Agilent Technologies International private limited|Akebono Soft Technologies private limited|AkebonoSoft Technologies private limited|Akmin Technologies|Algorhythm Technologies private limited|Allsec Technologies private limited|Alphonso Informex private limited|Altria Client Services|Altruist India private limited|Amdocs|Amdocs Development Center India private limited|Amdocs Development Centre India|American CyberSystems|American Express Service India private limited|American Stock Exchange|Amrok Securities|Anish Information Technology private limited|Ankhnet Informations private limited|Apex Technologies private limited|AppLabs|AppLabs Technologies private limited|Appshark India|Apptix Software private limited|Aquila Technologies|Arcot R and D Software private limited|Arsin Systems private limited|Ascendum Solutions private limited|AskMe Software private limited|Atos Origin private limited|Atos Origin|Atos Origin India private limited|Aurigo Software Technologies private limited|Aurona Technologies private limited|Autopower Software Solutions|Aztecsoft|BMC Software India private limited|Balasai Net private limited|Bayon Solutions private limited|Beachwood Computing Limited|Birlasoft limited|Blue Bird Technologies private limited|Blue Fountain Media private limited|Blue Star InfoTech|Boden Inc|Boston|Braahamam Net Solutions private limited|Braahmam Net Solutions private limited|Brain Soft technology private limited|Brigade Corporation Private Limited|Business Link Automation India private limited|BusinessLink Automation private limited|C Ahead Info Technologies India private limited|C.D.I Corporation|CCG India private limited|CEM Solutions|CGI Information Systems and Management Consultants private limited|CGI Information Systems private limited|CGI Information System and Management Consultants private limited|CGI Information and Management private limited|CGI Netvorks|CISCO Systems India private limited|CMC Limited|COMSYS Inc|CORE SHELL TECHNOLOGIES|CRC Software India private limited|CRV Executive Search private limited|CS Software Solutions private Limited|CSC India private Limited|CSS Corp private limited|Cambridge Solutions Limited|Cambridge Solutions|Cambridge Solutions Sdn. Bhd|Candor Ind. private limited|Candor India private limited|Canvas Creatives private limited|Canvera|Capgemini Business Service India Limited|Capgemini private)
I am using C# for this stuff.
Please enlighten !!!!
The current std::regex design and implementation are slow, mostly because the RE pattern is parsed and compiled at runtime. Users often don't need a runtime RE parser engine as the pattern is known during compilation in many common use cases.
Regex has an interpreted mode and a compiled mode. The compiled mode takes longer to start, but is generally faster.
Being more specific with your regular expressions, even if they become much longer, can make a world of difference in performance. The fewer characters you scan to determine the match, the faster your regexes will be.
String operations will always be faster than regular expression operations. Unless, of course, you write the string operations in an inefficient way. Regular expressions have to be parsed, and code generated to perform the operation using string operations.
You can greatly improve the performance of this regex by prepending \b
at the beginning:
\b(ACS| ... |Z)
This will prevent a check on every character, and check every word instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With