Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Regex Matching Partial Parentheses ONLY

Tags:

python

regex

I have some poorly formatted text that I need to filter. As such, there are plenty of cases in which a quote in the text begins in one line and then cuts off and finishes in a second line. In such a case, my preference is to just remove the partial quotes completely, BUT, I want to preserve regular full quotes. I know that this can be done iteratively with a counter, but I would really prefer to go about it with Regular Expressions.

Take fore example:

"This is a quote"
This is an end "partial-
quote" Here is more text.
This is an end "partial-
quote w/o more text"
This is an "embedded" quote

Here is an example with my current attempt (\"[^\"\n]+?|^[^\"\n]+?\")(\n|$) Note that it fails in two circumstances:

  1. Line 3 -- a partial quote proceeds the rest of a sentence (very rare occurrence, so if we can't solve its not the end of the world).
  2. line 6 -- an embedded quote. This is a major problem and the main reason I have taken to SO with my problem. It grabs the last quote in the embedded quote to the end of the line.

I figured that I could set up an if statement and run each line through, checking if it has less than two quotes and then proceeding to parse the partial quotes, but I thought the minds at SO would have a much cleaner solution.

NOTE The desired output is:

"This is a quote"
This is an end 
 Here is more text.
This is an end 
This is an "embedded" quote

(I handle the whitespaces later-on)

like image 517
andoni Avatar asked May 03 '26 15:05

andoni


2 Answers

Here you go,

^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"[^"\n]*\n[^"\n]*"(\n|)

Replace the matched characters with \1\n

DEMO

>>> import re
>>> s = '''"This is a quote"
This is an end "partial-
quote" Here is more text.
This is an end "partial-
quote w/o more text"
This is an "embedded" quote'''
>>> m = re.sub(r'(?m)^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"[^"\n]*\n[^"\n]*"(\n|)', r'\1\n', s)
>>> print(m)
"This is a quote"
This is an end 
 Here is more text.
This is an end 
This is an "embedded" quote

Use this regex, if you want to deal with more than one lines present inside between double quotes.

^((?:[^"\n]*"[^"\n]*")*[^"\n]*)"(?:[^"\n]*\n)+[^"\n]*"(\n|)

DEMO

like image 191
Avinash Raj Avatar answered May 05 '26 06:05

Avinash Raj


("[^"\n]*")|"[^"]*(\n)[^"]*"(?![^\n]*")|"[^"]*\n.*?(?=\n[^"]*"[^\n"]*")

You can try this.This will take case of odd number of quotes as well.See demo.

https://regex101.com/r/dL7oF8/6

like image 35
vks Avatar answered May 05 '26 06:05

vks