I am trying to handle un-matched double quotes within a string in the CSV format.
To be precise,
"It "does "not "make "sense", Well, "Does "it"
should be corrected as
"It" "does" "not" "make" "sense", Well, "Does" "it"
So basically what I am trying to do is to
replace all the ' " '
- Not preceded by a beginning of line or a comma (and)
- Not followed by a comma or an end of line
with ' " " '
For that I use the below regex
(?<!^|,)"(?!,|$)
The problem is while Ruby regex engines ( http://www.rubular.com/ ) are able to parse the regex, python regex engines (https://pythex.org/ , http://www.pyregex.com/) throw the following error
Invalid regular expression: look-behind requires fixed-width pattern
And with python 2.7.3 it throws
sre_constants.error: look-behind requires fixed-width pattern
Can anyone tell me what vexes python here?
==================================================================================
Following Tim's response, I got the below output for a multi line string
>>> str = """ "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it"
... "It "does "not "make "sense", Well, "Does "it" """
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str)
' "It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" "\n"It" "does" "not" "make" "sense", Well, "Does" "it" " '
At the end of each line, next to 'it' two double-quotes were added.
So I made a very small change to the regex to handle a new-line.
re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
But this gives the output
>>> re.sub(r'\b\s*"(?!,|$)', '" "', str,flags=re.MULTILINE)
' "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it"\n... "It" "does" "not" "make" "sense", Well, "Does" "it" " '
The last 'it' alone has two double-quotes.
But I wonder why the '$' end of line character will not identify that the line has ended.
==================================================================================
The final answer is
re.sub(r'\b\s*"(?!,|[ \t]*$)', '" "', str,flags=re.MULTILINE)
Python re
lookbehinds really need to be fixed-width, and when you have alternations in a lookbehind pattern that are of different length, there are several ways to handle this situation:
(?<=[^,])"(?!,|$)
of your current pattern that requires a char other than a comma before the double quote, or a common pattern to match words enclosed with whitespace, (?<=\s|^)\w+(?=\s|$)
, can be written as (?<!\S)\w+(?!\S)
), or(?<=a|bc)
should be rewritten as (?:(?<=a)|(?<=bc))
)(?<=\s|^)
matches either a whitespace or start of a string/line (if re.M
is used). So, in Python re
, use (?<!\S)
. The (?<=^|;)
will be converted to (?<![^;])
. And if you also want to make sure the start of a line is matched, too, add \n
to the negated character class, e.g. (?<![^;\n])
(see Python Regex: Match start of line, or semi-colon, or start of string, none capturing group). Note this is not necessary with (?<!\S)
as \S
does not match a line feed char.(?<!^|,)"(?!,|$)
should look like (?<!^)(?<!,)"(?!,|$)
).Or, simply install PyPi regex module using pip install regex
(or pip3 install regex
) and enjoy infinite width lookbehind.
Python lookbehind assertions need to be fixed width, but you can try this:
>>> s = '"It "does "not "make "sense", Well, "Does "it"'
>>> re.sub(r'\b\s*"(?!,|$)', '" "', s)
'"It" "does" "not" "make" "sense", Well, "Does" "it"'
Explanation:
\b # Start the match at the end of a "word"
\s* # Match optional whitespace
" # Match a quote
(?!,|$) # unless it's followed by a comma or end of string
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With