I have the following string: <pre class="prettyprint lang-none prettyprint-override"><code>'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.' </code></pre> Now, I wish to extract the following quotes: <pre class="prettyprint lang-none prettyprint-override"><code>1. Well, I've tried to say "How Doth the Little Busy Bee," but it all came different! 2. How Doth the Little Busy Bee, 3. I'll try again. </code></pre> I tried the following code but I'm not getting what I want. The <code>[^\1]*</code> is not working as expected. Or is the problem elsewhere? <pre class="prettyprint"><code>import re s = "'Well, I've tried to say \"How Doth the Little Busy Bee,\" but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'" for i, m in enumerate(re.finditer(r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=([^\1]*)\1)', s)): print("\nGroup {:d}: ".format(i+1)) for g in m.groups(): print(' '+g) </code></pre>

If you really need to return all the results from a single regular expression applied only once, it will be necessary to use lookahead (<code>(?=findme)</code>) so the finding position goes back to the start after each match - see this answer for a more detailed explanation. To prevent false matches, some clauses are also needed regarding the quotes that add complexity, e.g. the apostrophe in <code>I've</code> shouldn't count as an opening or closing quote. There's no single clear-cut way of doing this but the rules I've gone for are: <ol> <li>An opening quote must not be immediately preceeded by a word character (e.g. letter). So for example, <code>A"</code> would not count as an opening quote but <code>,"</code> would count.</li> <li>A closing quote must not be immediately followed by a word character (e.g. letter). So for example, <code>'B</code> would not count as a closing quote but <code>'.</code> would count.</li> </ol> Applying the above rules leads to the following regular expression: <pre class="prettyprint"><code>(?=(?:(?<!\w)'(\w.*?)'(?!\w)|\"(\w.*?)\"(?!\w))) </code></pre> <img src="https://www.debuggex.com/i/W0ecahxzu9EuiDcE.png" alt="Regular expression visualization"> Debuggex Demo A good quick sanity check test on any possible candidate regular expression is to reverse the quotes. This has been done in this regex101 demo.

EDIT I modified my regex, it match properly even more complicated cases: <pre class="prettyprint"><code>(?=(?<!\w|[!?.])('|\")(?!\s)(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w)) </code></pre> DEMO It is now even more complicated, the main improvement is not matching directly after some of punctuation character (<code>[!?.]</code>) and better quote case separation. Verified on diversified examples. The sentence will be in <code>content</code> captured group. Of course it has some restrictions, releted to usage of whitespaces, etc. But it should work with most of proper formatted sentences - or at least it work with examples. <ul> <li> <code>(?=(?<!\w|[!?.])('|\")(?!\s)</code> - match the <code>'</code> or <code>"</code> not preceded by word or punctuation character (<code>(?<!\w|[!?.])</code>) or not fallowed by whitespace(<code>(?!\s)</code>), the <code>'</code> or <code>"</code> part is captured in group 1 to further use,</li> <li> <code>(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w))</code> - match sentence, followed by same char (<code>'</code> or <code>"</code> captured in group 1) as it was started, ignore other quotes</li> </ul> It doesn't match whole sentence directly, but with capturing group nested in lookaround construct, so with global match modifier it will match also sentences inside sentences - because it directly match only the place before sentence starts. About your regex: I suppose, that by <code>[^\1]*</code> you meant any char but not one captured in group 1, but character class doesn't work this way, because it treats <code>\1</code> as an char in octal notation (which I think is some kind of whitespace) not a reference to capturing group. Take a look on this example - read explanation. Also compare matching of THIS and THIS regex. To achieve what you want, you should use lookaround, something like this: <code>(')((?:.(?!\1))*.)</code> - capture the opening char, then match every char which is not followed by captured opening char, then capture one more char, which is directly before captured char - and you have whole content between chars you excluded.

Using regex, extract quoted strings that may contain nested quotes

Q: How do you include a quote in regex?

Try putting a backslash ( \ ) followed by &quot; .

Tags:

python

regex

I have the following string:

'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'

Now, I wish to extract the following quotes:

1. Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!
2. How Doth the Little Busy Bee,
3. I'll try again.

I tried the following code but I'm not getting what I want. The [^\1]* is not working as expected. Or is the problem elsewhere?

import re

s = "'Well, I've tried to say \"How Doth the Little Busy Bee,\" but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'"

for i, m in enumerate(re.finditer(r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=([^\1]*)\1)', s)):
    print("\nGroup {:d}: ".format(i+1))
    for g in m.groups():
        print('  '+g)

891

asked Sep 22 '16 11:09

coder.in.me

2 Answers

If you really need to return all the results from a single regular expression applied only once, it will be necessary to use lookahead ((?=findme)) so the finding position goes back to the start after each match - see this answer for a more detailed explanation.

To prevent false matches, some clauses are also needed regarding the quotes that add complexity, e.g. the apostrophe in I've shouldn't count as an opening or closing quote. There's no single clear-cut way of doing this but the rules I've gone for are:

An opening quote must not be immediately preceeded by a word character (e.g. letter). So for example, A" would not count as an opening quote but ," would count.
A closing quote must not be immediately followed by a word character (e.g. letter). So for example, 'B would not count as a closing quote but '. would count.

Applying the above rules leads to the following regular expression:

(?=(?:(?<!\w)'(\w.*?)'(?!\w)|\"(\w.*?)\"(?!\w)))

Regular expression visualization

Debuggex Demo

A good quick sanity check test on any possible candidate regular expression is to reverse the quotes. This has been done in this regex101 demo.

122

answered Oct 12 '22 11:10

Steve Chambers

EDIT

I modified my regex, it match properly even more complicated cases:

(?=(?<!\w|[!?.])('|\")(?!\s)(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w))

DEMO

It is now even more complicated, the main improvement is not matching directly after some of punctuation character ([!?.]) and better quote case separation. Verified on diversified examples.

The sentence will be in content captured group. Of course it has some restrictions, releted to usage of whitespaces, etc. But it should work with most of proper formatted sentences - or at least it work with examples.

(?=(?<!\w|[!?.])('|\")(?!\s) - match the ' or " not preceded by word or punctuation character ((?<!\w|[!?.])) or not fallowed by whitespace((?!\s)), the ' or " part is captured in group 1 to further use,
(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w)) - match sentence, followed by same char (' or " captured in group 1) as it was started, ignore other quotes

It doesn't match whole sentence directly, but with capturing group nested in lookaround construct, so with global match modifier it will match also sentences inside sentences - because it directly match only the place before sentence starts.

About your regex:

I suppose, that by [^\1]* you meant any char but not one captured in group 1, but character class doesn't work this way, because it treats \1 as an char in octal notation (which I think is some kind of whitespace) not a reference to capturing group. Take a look on this example - read explanation. Also compare matching of THIS and THIS regex.

To achieve what you want, you should use lookaround, something like this: (')((?:.(?!\1))*.) - capture the opening char, then match every char which is not followed by captured opening char, then capture one more char, which is directly before captured char - and you have whole content between chars you excluded.

answered Oct 12 '22 11:10

m.cekiera

Related questions
                            
                                How to filter data from a data frame when the number of columns are dynamic?
                            
                                How can I capture a key press (key logging) in Linux?
                            
                                what are the differences between import and extends in Flask?
                            
                                Execute flask-SQLAlchemy subquery
                            
                                How to put a JSON file's content in a response
                            
                                List comprehension works but not for loop––why?
                            
                                Finding the area of intersection of multiple overlapping rectangles in Python
                            
                                Opening a gzip file in python Apache Beam
                            
                                Do locally set Cython compiler directives affect one or all functions?
                            
                                additional column when saving pandas data frame to csv file
                            
                                Pandas Dataframe Line Plot: Show Random Markers
                            
                                Python Pandas read_excel doesn't recognize null cell
                            
                                Run multiple servers in python at same time (Threading)
                            
                                How to use yaml.load_all with fileinput.input?
                            
                                Divide two dataframes with python
                            
                                crontab to run python file if not running already
                            
                                How move a multipolygon with geopandas in python2
                            
                                Calculating the sum of a series?
                            
                                Python dictionary lookup performance, get vs in
                            
                                How do I pull a recurring key from a JSON?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With