Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using regex, extract quoted strings that may contain nested quotes

Tags:

python

regex

I have the following string:

'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'

Now, I wish to extract the following quotes:

1. Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!
2. How Doth the Little Busy Bee,
3. I'll try again.

I tried the following code but I'm not getting what I want. The [^\1]* is not working as expected. Or is the problem elsewhere?

import re

s = "'Well, I've tried to say \"How Doth the Little Busy Bee,\" but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'"

for i, m in enumerate(re.finditer(r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=([^\1]*)\1)', s)):
    print("\nGroup {:d}: ".format(i+1))
    for g in m.groups():
        print('  '+g)
like image 891
coder.in.me Avatar asked Sep 22 '16 11:09

coder.in.me


People also ask

How do you include a quote in regex?

Try putting a backslash ( \ ) followed by " .

What is nested quoted referencing?

A nested quotation is a quotation that is encapsulated inside another quotation, forming a hierarchy with multiple levels. When focusing on a certain quotation, one must interpret it within its scope.

How do you do nested quotes?

Use single quotes for a nested quotation, when someone repeats what someone else said. Joe smiled and said, "Jenny said 'yes' when I asked her to marry me." If you need another layer of quotation, just keep alternating between single and double quotation marks. "Joe was just here," said Susan.


2 Answers

If you really need to return all the results from a single regular expression applied only once, it will be necessary to use lookahead ((?=findme)) so the finding position goes back to the start after each match - see this answer for a more detailed explanation.

To prevent false matches, some clauses are also needed regarding the quotes that add complexity, e.g. the apostrophe in I've shouldn't count as an opening or closing quote. There's no single clear-cut way of doing this but the rules I've gone for are:

  1. An opening quote must not be immediately preceeded by a word character (e.g. letter). So for example, A" would not count as an opening quote but ," would count.
  2. A closing quote must not be immediately followed by a word character (e.g. letter). So for example, 'B would not count as a closing quote but '. would count.

Applying the above rules leads to the following regular expression:

(?=(?:(?<!\w)'(\w.*?)'(?!\w)|\"(\w.*?)\"(?!\w)))

Regular expression visualization

Debuggex Demo

A good quick sanity check test on any possible candidate regular expression is to reverse the quotes. This has been done in this regex101 demo.

like image 122
Steve Chambers Avatar answered Oct 12 '22 11:10

Steve Chambers


EDIT

I modified my regex, it match properly even more complicated cases:

(?=(?<!\w|[!?.])('|\")(?!\s)(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w))

DEMO

It is now even more complicated, the main improvement is not matching directly after some of punctuation character ([!?.]) and better quote case separation. Verified on diversified examples.

The sentence will be in content captured group. Of course it has some restrictions, releted to usage of whitespaces, etc. But it should work with most of proper formatted sentences - or at least it work with examples.

  • (?=(?<!\w|[!?.])('|\")(?!\s) - match the ' or " not preceded by word or punctuation character ((?<!\w|[!?.])) or not fallowed by whitespace((?!\s)), the ' or " part is captured in group 1 to further use,
  • (?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w)) - match sentence, followed by same char (' or " captured in group 1) as it was started, ignore other quotes

It doesn't match whole sentence directly, but with capturing group nested in lookaround construct, so with global match modifier it will match also sentences inside sentences - because it directly match only the place before sentence starts.

About your regex:

I suppose, that by [^\1]* you meant any char but not one captured in group 1, but character class doesn't work this way, because it treats \1 as an char in octal notation (which I think is some kind of whitespace) not a reference to capturing group. Take a look on this example - read explanation. Also compare matching of THIS and THIS regex.

To achieve what you want, you should use lookaround, something like this: (')((?:.(?!\1))*.) - capture the opening char, then match every char which is not followed by captured opening char, then capture one more char, which is directly before captured char - and you have whole content between chars you excluded.

like image 42
m.cekiera Avatar answered Oct 12 '22 11:10

m.cekiera