Hello regular expression experts, There has never been a string manipulation problem I couldn't resolve with regular expressions until now, at least in an elegant manner using just one step. Here is the sample data I'm working with: <blockquote> 0,"section1","(7) Delivery of 'certificate' outside the United States prohibited. Since both section 339 of the 1940 statute, 68/ and section 341 of the present law are explicit in their statement that the certificate shall be furnished the citizen, only if such individual is at the time within the United States, it is clear that the document could not and cannot be delivered outside the United States.",http://www.google.com/ 1,"section2",,http://www.google.com/ 2,"section3",",,",http://www.google.com/ </blockquote> This is a section of a much larger CSV file. With one elegant regular expression, I'd like to replace only all the commas that occur within the double quotes with an underscore character (_). It is important that the regular expression does NOT replace any commas outside the quotes because that would mess up the CSV data structure. Thanks, Tom -- CLARIFICATION: Sorry guys, I posted the question without fully clarifying my situation, so let me summarize below: <ul> <li>Assume that quotes within quotes are already escaped (quotes within quotes in a CSV file saved by Excel are represented by <code>""</code> or <code>"""</code> etc., so they are easily replaced beforehand).</li> <li>I am working within JavaScript.</li> </ul> Using the sample text above, here is what it SHOULD look like after running the regular expression replacement (there should be a total of 5 replacements): <blockquote> 0,"section1","(7) Delivery of 'certificate' outside the United States prohibited. Since both section 339 of the 1940 statute_ 68/ and section 341 of the present law are explicit in their statement that the certificate shall be furnished the citizen_ only if such individual is at the time within the United States_ it is clear that the document could not and cannot be delivered outside the United States.",http://www.google.com/ 1,"section2",,http://www.google.com/ 2,"section3","__",http://www.google.com/ </blockquote>

Regular expressions are not particularly good at matching balanced text (i.e. starting and ending quotes). A naïve approach would be to repeatedly apply something like this (until it no longer matched): <pre class="prettyprint"><code>s/(^[^"]*(?:"[^"]*"[^"]*)*?)"([^",]*),([^"]*)"/$1"$2_$3"/ </code></pre> But that wouldn't work with escaped quotes. The best (i.e. simplest, most readable, and most maintanable) solution is to use a CSV file parser, go through all the field values one by one (replacing commas with underscores as you go), then write it back out to the file.

Regular Expressions - how to replace a character within quotes

Tags:

regex

Hello regular expression experts,

There has never been a string manipulation problem I couldn't resolve with regular expressions until now, at least in an elegant manner using just one step. Here is the sample data I'm working with:

0,"section1","(7) Delivery of 'certificate' outside the United States prohibited. Since both section 339 of the 1940 statute, 68/ and section 341 of the present law are explicit in their statement that the certificate shall be furnished the citizen, only if such individual is at the time within the United States, it is clear that the document could not and cannot be delivered outside the United States.",http://www.google.com/

1,"section2",,http://www.google.com/

2,"section3",",,",http://www.google.com/

This is a section of a much larger CSV file. With one elegant regular expression, I'd like to replace only all the commas that occur within the double quotes with an underscore character (_). It is important that the regular expression does NOT replace any commas outside the quotes because that would mess up the CSV data structure.

Thanks, Tom

CLARIFICATION:

Sorry guys, I posted the question without fully clarifying my situation, so let me summarize below:

Assume that quotes within quotes are already escaped (quotes within quotes in a CSV file saved by Excel are represented by "" or """ etc., so they are easily replaced beforehand).
I am working within JavaScript.

Using the sample text above, here is what it SHOULD look like after running the regular expression replacement (there should be a total of 5 replacements):

0,"section1","(7) Delivery of 'certificate' outside the United States prohibited. Since both section 339 of the 1940 statute_ 68/ and section 341 of the present law are explicit in their statement that the certificate shall be furnished the citizen_ only if such individual is at the time within the United States_ it is clear that the document could not and cannot be delivered outside the United States.",http://www.google.com/

1,"section2",,http://www.google.com/

2,"section3","__",http://www.google.com/

975

asked Dec 18 '10 05:12

thdoan

2 Answers

I'll help you, but you have to promise to stop using the word "elegant". It's been working too hard lately, and deserves a rest. :P

(?m),(?=[^"]*"(?:[^"\r\n]*"[^"]*")*[^"\r\n]*$)

This matches a comma if, between the comma and the end of the record, there's an odd number of quotation marks. I'm assuming a standard CSV format, in which a record ends at the next line separator that isn't enclosed in quotes. Line separators are legal inside quoted fields, as are quotes if they're escaped with another quote.

Depending on which regex flavor you're using, you may have to use \r?$ instead of just $. In .NET, for example, only the linefeed (\n) is considered a line separator. But in Java, $ matches before the \r in \r\n, but not between the \r and the \n (unless you set UNIX_LINES mode).

answered Oct 14 '22 18:10

Alan Moore

Regular expressions are not particularly good at matching balanced text (i.e. starting and ending quotes).

A naïve approach would be to repeatedly apply something like this (until it no longer matched):

s/(^[^"]*(?:"[^"]*"[^"]*)*?)"([^",]*),([^"]*)"/$1"$2_$3"/

But that wouldn't work with escaped quotes. The best (i.e. simplest, most readable, and most maintanable) solution is to use a CSV file parser, go through all the field values one by one (replacing commas with underscores as you go), then write it back out to the file.

answered Oct 14 '22 20:10

Cameron

Related questions
                            
                                multiple Field Separators in awk
                            
                                Regex for matching accent characters
                            
                                How to strip whitespace from before but not after punctuation in python
                            
                                Issue with a Look-behind Regular expression (Ruby)
                            
                                Apache rewrite subnet ip range
                            
                                Match string, but only if not preceded by other string
                            
                                Regex for no duplicate characters from a limited character pool
                            
                                How do I add a character at a specific position in a string?
                            
                                Splitting a nested string keeping quotation marks
                            
                                What do the "(?<!…)" symbols mean in a Python regular expression?
                            
                                Python3 regex on bytes variable [duplicate]
                            
                                Use RegExp to match a parenthetical number then increment it
                            
                                Trying to find groups of letters with regex
                            
                                Regular Expression to Match " | "
                            
                                Can java.util.regex.Pattern do partial matches?
                            
                                What regular expression do I need to check for some non-latin characters?
                            
                                Powershell replace lose line breaks
                            
                                Regex to match . (periods marking end of sentences) but not Mr. (as in Mr. Hopkins)
                            
                                Search for a word in a String
                            
                                Emacs regexp groups in regex-replace

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With