I have the following string:
'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'
Now, I wish to extract the following quotes:
1. Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!
2. How Doth the Little Busy Bee,
3. I'll try again.
I tried the following code but I'm not getting what I want. The [^\1]*
is not working as expected. Or is the problem elsewhere?
import re
s = "'Well, I've tried to say \"How Doth the Little Busy Bee,\" but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'"
for i, m in enumerate(re.finditer(r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=([^\1]*)\1)', s)):
print("\nGroup {:d}: ".format(i+1))
for g in m.groups():
print(' '+g)
Try putting a backslash ( \ ) followed by " .
A nested quotation is a quotation that is encapsulated inside another quotation, forming a hierarchy with multiple levels. When focusing on a certain quotation, one must interpret it within its scope.
Use single quotes for a nested quotation, when someone repeats what someone else said. Joe smiled and said, "Jenny said 'yes' when I asked her to marry me." If you need another layer of quotation, just keep alternating between single and double quotation marks. "Joe was just here," said Susan.
If you really need to return all the results from a single regular expression applied only once, it will be necessary to use lookahead ((?=findme)
) so the finding position goes back to the start after each match - see this answer for a more detailed explanation.
To prevent false matches, some clauses are also needed regarding the quotes that add complexity, e.g. the apostrophe in I've
shouldn't count as an opening or closing quote. There's no single clear-cut way of doing this but the rules I've gone for are:
A"
would not count as an opening quote but ,"
would count.'B
would not count as a closing quote but '.
would count.Applying the above rules leads to the following regular expression:
(?=(?:(?<!\w)'(\w.*?)'(?!\w)|\"(\w.*?)\"(?!\w)))
Debuggex Demo
A good quick sanity check test on any possible candidate regular expression is to reverse the quotes. This has been done in this regex101 demo.
EDIT
I modified my regex, it match properly even more complicated cases:
(?=(?<!\w|[!?.])('|\")(?!\s)(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w))
DEMO
It is now even more complicated, the main improvement is not matching directly after some of punctuation character ([!?.]
) and better quote case separation. Verified on diversified examples.
The sentence will be in content
captured group. Of course it has some restrictions, releted to usage of whitespaces, etc. But it should work with most of proper formatted sentences - or at least it work with examples.
(?=(?<!\w|[!?.])('|\")(?!\s)
- match the '
or "
not preceded by word or punctuation character ((?<!\w|[!?.])
) or not fallowed by whitespace((?!\s)
), the '
or "
part is captured in group 1 to further use,(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w))
- match sentence, followed by
same char ('
or "
captured in group 1) as it was started, ignore other quotesIt doesn't match whole sentence directly, but with capturing group nested in lookaround construct, so with global match modifier it will match also sentences inside sentences - because it directly match only the place before sentence starts.
About your regex:
I suppose, that by [^\1]*
you meant any char but not one captured in group 1, but character class doesn't work this way, because it treats \1
as an char in octal notation (which I think is some kind of whitespace) not a reference to capturing group. Take a look on this example - read explanation. Also compare matching of THIS and THIS regex.
To achieve what you want, you should use lookaround, something like this: (')((?:.(?!\1))*.)
- capture the opening char, then match every char which is not followed by captured opening char, then capture one more char, which is directly before captured char - and you have whole content between chars you excluded.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With