I need to search for lines in a CSV file that end in an unterminated, double-quoted string.
For example:
1,2,a,b,"dog","rabbit
would match whereas
1,2,a,b,"dog","rabbit","cat bird"
1,2,a,b,"dog",rabbit
would not.
I have very limited experience with regular expressions, and the only thing I could think of is something like
"[^"]*$
However, that matches the last quote to the end of the line.
How would this be done?
Assuming quotes can't be escaped, you need to test the parity of quotes (making sure that there's an even number of them instead of odd). Regular expressions are great for that:
^(([^"]*"){2})*[^"]*$
That will match all lines with an even number of quotes. You can invert the result for all strings with an odd number. Or you can just add another ([^"]*")
part at the beginning:
^[^"]*"(([^"]*"){2})*[^"]*$
Similarly, if you have access to reluctant operators instead of greedy ones you can use a simpler-looking expression:
^((.*"){2})*.*$ #even
^.*"((.*"){2})*.*$ #odd
Now, if quotes can be escaped, it's a different question entirely, but the approach would be similar: determine the parity of unescaped quotes.
Assuming that the strings cannot contain "
, you need to match a string that has an odd number of quotes, like this:
([^"]*("[^"]*")?)*"
Note that this is vulnerable to a DDOS attack.
This will match zero or more sets of unquoted run, followed by quoted strings.
Try this one:
".+[^"](,|$)
This matches a quote (anywhere in the line), followed (greedily) by anything but another quote before the end of the line or a comma.
The net affect is that it will only match lines with dangling quoted strings.
I think it's even immune to 'nested expandos attacks' (we do live in a very dangerous world ...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With