Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to match strings in quotes with double-quotes inside

Tags:

java

regex

java-6

I face a challenge to match the input in the following format:

  • The input consists of key=value pairs. The key starts with slash. The value may be a number or a string in quotes.
  • The value may optionally contain escaped quotes, that is quote following by a quote (""). Such escaped quote should be considered a part of value. There is no need to check that escaped quotes are balanced (e.g. ends by another escaped quote).

The regular expression should match the given key=value part of the sequence and should not break for long inputs (e.g. value is 10000 characters).

First I came to this solution:

/(\w+)=(\d+|"(?:""|[^"])+"(?!"))

and it performs not bad, however it fails in Java6 with StackOverflowError for long inputs (cashes regexplanet for example). I tried to improve it a bit to run faster:

/(\w+)=(\d+|"(?:""|[^"]+)+"(?!"))

but then if input is not matching, it enters endless loop in backtracking trying to match it.

Then I came to this regex:

/(\w+)=(\d+|".+?(?<!")(?:"")*"(?!"))

which is performing slower, but it seems to solve the task.

Can anyone suggest a better / faster regex?

Sample input:

/mol_type="protein" /transl_table=11 /note="[CDS] (""multi
line)"  nn  /organism="""Some"" Sequence" nn  /organism="Some ""Sequence"""
/translation="MHPSSSRIPHIAVVGVSAIFPGSLDAHGFWRDILSGTDLITDVPSTHWLVE
DYYDPDPSAPDKTYAKRGAFLKDVPFDPLEWGVPPSIVPATDTTQLLALIVAKRVLEDAAQGQFE
SMSRERMSVILGVTSAQELLASMVSRIQRPVWAKALRDLGYPEDEVKRACDKIAGNYVPWQESSF
PGLLGNVVAGRIANRLDLGGTNCVTDAACASSLSAMSMAINELALGQSDLVIAGGCDTMNDAFMY
MCFSKTPALSKSGDCRPFSDKADGTLLGEGIAMVALKRLDDAERDGDRVYAVIRGIGSSSDGRSK
SVYAPVPEGQAKALRRTYAAAGYGPETVELMEAHGTGTKAGDAAEFEGLRAMFDESGREDRQWCA
LGSVKSQIGHTKAAAGAAGLFKAIMALHHKVLPPTIKVDKPNPKLDIEKTAFYLNTQARPWIRPG
DHPRRASVSSFGFGGSNFHVALEEYTGPAPKAWRVRALPAELFLLSADTPAALADRARALAKEAE
VPEILRFLARESVLSFDASRPARLGLCATDEADLRKKLEQVAAHLEARPEQALSAPLVHCASGEA
PGRVAFLFPGQGSQYVGMGADALMTFDPARAAWDAAAGVAIADAPLHEVVFPRPVFSDEDRAAQE
ARLRETRWAQPAIGATSLAHLALLAALGVRAEAFAGHSFGEITALHAAGALSAADLLRVARRRGE
LRTLGQVVDHLRASLPAAGPAASASPAAAASVPKASTAAVPAVASVAAPGAAEVERVVMAVVAET
TGYPAEMLGLQMELESDLGIDSIKRVEILSAVRDRTPGLSEVDASALAQLRTLGQVVDHLRASLP
AASAGPAVAAPAAKAPAVAAPTGVSGATPGAAEVERVVMAVVAETTGYPAEMLGLQMELESDLGI
DSIKRVEILSAVRDRTPGLAEVDASALAQLRTLGQVVDHLRASLGPAAVTAGAAPAEPAEEPAST
PLGRWTLVEEPAPAAGLAMPGLFDAGTLVITGHDAIGPALVAALAARGIAAEYAPAVPRGARGAV
FLGGLRELATADAALAVHREAFLAAQAIAAKPALFVTVQDTGGDFGLAGSDRAWVGGLPGLVKTA
ALEWPEASCRAIDLERAGRSDGELAEAIASELLSGGVELEIGLRADGRRTTPRSVRQDAQPGPLP
LGPSDVVVASGGARGVTAATLIALARASHARFALLGRTALEDEPAACRGADGEAALKAALVKAAT
SAGQRVTPAEIGRSVAKILANREVRATLDAIRAAGGEALYVPVDVNDARAVAAALDGVRGALGPV
TAIVHGAGVLADKLVAEKTVEQFERVFSTKVDGLRALLGATAGDPLKAIVLFSSIAARGGNKGQC
DYAMANEVLNKVAAAEAARRPGCRVKSLGWGPWQGGMVNAALEAHFAQLGVPLIPLAAGAKMLLD
ELCDASGDRGARGQGGAPPGAVELVLGAEPKALAAQGHGGRVALAVRADRATHPYLGDHAINGVP
VVPVVIALEWFARAARACRPDLVVTELRDVRVLRGIKLAAYESGGEVFRVDCREVSNGHGAVLAA
ELRGPQGALHYAATIQMQQPEGRVAPKGPAAPELGPWPAGGELYDGRTLFHGRDFQVIRRLDGVS
RDGIAGTVVGLREAGWVAQPWKTDPAALDGGLQLATLWTQHVLGGAALPMSVGALHTFAEGPSDG
PLRAVVRGQIVARDRTKADIAFVDDRGSLVAELRDVQYVLRPDTARGQA"
/note="primer of  Streptococcus pneumoniae

Expected output (from regexhero.net):

RegEx

like image 756
dma_k Avatar asked Apr 14 '14 11:04

dma_k


2 Answers

In order to fail in a reasonable time you need, indeed, to avoid catastrophic backtracking. This can be done using atomic grouping (?>...):

/(\w+)=(\d+|"(?>(?>""|[^"]+)+)"(?!"))

# (?>(?>""|[^"]+)+)
(?>               # throw away the states created by (...)+
    (?>           # throw away the states created by [^"]+
        ""|[^"]+
    )+
)

Your issue when using (?:""|[^"]+)+ on a string that will never match, is linked to the fact that each time you match a new [^"] character the regex engine can choose to use the inner or outer + quantifier.

This leads to a lot of possibilities for backtracking, and before returning a failure the engine has to try them all.

We know that if we haven't found a match by the time the engine reaches the end, we never will: all we need to do is throw away the backtracking positions to avoid the issue, and that's what atomic grouping is for.

See a DEMO: 24 steps on failure, while preserving the speed on the successful cases (not a real benchmarking tool, but catastrophic backtracking would be pretty easy to spot)

like image 91
Robin Avatar answered Nov 14 '22 20:11

Robin


Your initial regex was already quite good, but it was more complicated than necessary, leading to catastrophic backtracking.

You should use

/(\w+)=(\d+|"(?:""|[^"])*"(?!"))

See it live on regex101.com.

Explanation:

/                # Slash
(\w+)            # Indentifier --> Group 1
=                # Equals sign
(                # Group 2:
 \d+             # Either a number
|                # or
 "(?:""|[^"])*"  # a quoted string
 (?!")           # unless another quote follows
)                # End of group 2
like image 36
Tim Pietzcker Avatar answered Nov 14 '22 20:11

Tim Pietzcker