I face a challenge to match the input in the following format:
The regular expression should match the given key=value part of the sequence and should not break for long inputs (e.g. value is 10000 characters).
First I came to this solution:
/(\w+)=(\d+|"(?:""|[^"])+"(?!"))
and it performs not bad, however it fails in Java6 with StackOverflowError for long inputs (cashes regexplanet for example). I tried to improve it a bit to run faster:
/(\w+)=(\d+|"(?:""|[^"]+)+"(?!"))
but then if input is not matching, it enters endless loop in backtracking trying to match it.
Then I came to this regex:
/(\w+)=(\d+|".+?(?<!")(?:"")*"(?!"))
which is performing slower, but it seems to solve the task.
Can anyone suggest a better / faster regex?
Sample input:
/mol_type="protein" /transl_table=11 /note="[CDS] (""multi
line)" nn /organism="""Some"" Sequence" nn /organism="Some ""Sequence"""
/translation="MHPSSSRIPHIAVVGVSAIFPGSLDAHGFWRDILSGTDLITDVPSTHWLVE
DYYDPDPSAPDKTYAKRGAFLKDVPFDPLEWGVPPSIVPATDTTQLLALIVAKRVLEDAAQGQFE
SMSRERMSVILGVTSAQELLASMVSRIQRPVWAKALRDLGYPEDEVKRACDKIAGNYVPWQESSF
PGLLGNVVAGRIANRLDLGGTNCVTDAACASSLSAMSMAINELALGQSDLVIAGGCDTMNDAFMY
MCFSKTPALSKSGDCRPFSDKADGTLLGEGIAMVALKRLDDAERDGDRVYAVIRGIGSSSDGRSK
SVYAPVPEGQAKALRRTYAAAGYGPETVELMEAHGTGTKAGDAAEFEGLRAMFDESGREDRQWCA
LGSVKSQIGHTKAAAGAAGLFKAIMALHHKVLPPTIKVDKPNPKLDIEKTAFYLNTQARPWIRPG
DHPRRASVSSFGFGGSNFHVALEEYTGPAPKAWRVRALPAELFLLSADTPAALADRARALAKEAE
VPEILRFLARESVLSFDASRPARLGLCATDEADLRKKLEQVAAHLEARPEQALSAPLVHCASGEA
PGRVAFLFPGQGSQYVGMGADALMTFDPARAAWDAAAGVAIADAPLHEVVFPRPVFSDEDRAAQE
ARLRETRWAQPAIGATSLAHLALLAALGVRAEAFAGHSFGEITALHAAGALSAADLLRVARRRGE
LRTLGQVVDHLRASLPAAGPAASASPAAAASVPKASTAAVPAVASVAAPGAAEVERVVMAVVAET
TGYPAEMLGLQMELESDLGIDSIKRVEILSAVRDRTPGLSEVDASALAQLRTLGQVVDHLRASLP
AASAGPAVAAPAAKAPAVAAPTGVSGATPGAAEVERVVMAVVAETTGYPAEMLGLQMELESDLGI
DSIKRVEILSAVRDRTPGLAEVDASALAQLRTLGQVVDHLRASLGPAAVTAGAAPAEPAEEPAST
PLGRWTLVEEPAPAAGLAMPGLFDAGTLVITGHDAIGPALVAALAARGIAAEYAPAVPRGARGAV
FLGGLRELATADAALAVHREAFLAAQAIAAKPALFVTVQDTGGDFGLAGSDRAWVGGLPGLVKTA
ALEWPEASCRAIDLERAGRSDGELAEAIASELLSGGVELEIGLRADGRRTTPRSVRQDAQPGPLP
LGPSDVVVASGGARGVTAATLIALARASHARFALLGRTALEDEPAACRGADGEAALKAALVKAAT
SAGQRVTPAEIGRSVAKILANREVRATLDAIRAAGGEALYVPVDVNDARAVAAALDGVRGALGPV
TAIVHGAGVLADKLVAEKTVEQFERVFSTKVDGLRALLGATAGDPLKAIVLFSSIAARGGNKGQC
DYAMANEVLNKVAAAEAARRPGCRVKSLGWGPWQGGMVNAALEAHFAQLGVPLIPLAAGAKMLLD
ELCDASGDRGARGQGGAPPGAVELVLGAEPKALAAQGHGGRVALAVRADRATHPYLGDHAINGVP
VVPVVIALEWFARAARACRPDLVVTELRDVRVLRGIKLAAYESGGEVFRVDCREVSNGHGAVLAA
ELRGPQGALHYAATIQMQQPEGRVAPKGPAAPELGPWPAGGELYDGRTLFHGRDFQVIRRLDGVS
RDGIAGTVVGLREAGWVAQPWKTDPAALDGGLQLATLWTQHVLGGAALPMSVGALHTFAEGPSDG
PLRAVVRGQIVARDRTKADIAFVDDRGSLVAELRDVQYVLRPDTARGQA"
/note="primer of Streptococcus pneumoniae
Expected output (from regexhero.net):
In order to fail in a reasonable time you need, indeed, to avoid catastrophic backtracking. This can be done using atomic grouping (?>...)
:
/(\w+)=(\d+|"(?>(?>""|[^"]+)+)"(?!"))
# (?>(?>""|[^"]+)+)
(?> # throw away the states created by (...)+
(?> # throw away the states created by [^"]+
""|[^"]+
)+
)
Your issue when using (?:""|[^"]+)+
on a string that will never match, is linked to the fact that each time you match a new [^"]
character the regex engine can choose to use the inner or outer +
quantifier.
This leads to a lot of possibilities for backtracking, and before returning a failure the engine has to try them all.
We know that if we haven't found a match by the time the engine reaches the end, we never will: all we need to do is throw away the backtracking positions to avoid the issue, and that's what atomic grouping is for.
See a DEMO: 24 steps on failure, while preserving the speed on the successful cases (not a real benchmarking tool, but catastrophic backtracking would be pretty easy to spot)
Your initial regex was already quite good, but it was more complicated than necessary, leading to catastrophic backtracking.
You should use
/(\w+)=(\d+|"(?:""|[^"])*"(?!"))
See it live on regex101.com.
Explanation:
/ # Slash
(\w+) # Indentifier --> Group 1
= # Equals sign
( # Group 2:
\d+ # Either a number
| # or
"(?:""|[^"])*" # a quoted string
(?!") # unless another quote follows
) # End of group 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With