Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression To Split On Comma Except If Quoted

Tags:

c#

regex

csv

What is the regular expression to split on comma (,) except if surrounded by double quotes? For example:

max,emily,john = ["max", "emily", "john"]

BUT

max,"emily,kate",john = ["max", "emily,kate", "john"]

Looking to use in C#: Regex.Split(string, "PATTERN-HERE");

Thanks.

like image 242
Justin Avatar asked Nov 11 '10 01:11

Justin


People also ask

How do I split a string based on space but take quoted Substrings as one word?

How do I split a string based on space but take quoted substrings as one word? \S* - followed by zero or more non-space characters.

How do you match a comma in regex?

Starting with the carat ^ indicates a beginning of line. The 0-9 indicates characters 0 through 9, the comma , indicates comma, and the semicolon indicates a ; . The closing ] indicates the end of the character set. The plus + indicates that one or more of the "previous item" must be present.

How do you ignore a comma in a string in python?

sub() function to erase commas from the python string. The function re. sub() is used to swap the substring. Also, it will replace any match with the other parameter, in this case, the null string, eliminating all commas from the string.

How do you split a comma in Python?

Python Split String by Comma You can use a comma (,) as the separator to split a string in Python. It returns a list of strings contained between commas in your initial string. The string variable my_st was assigned values with commas (,) in between them.


2 Answers

Situations like this often call for something other than regular expressions. They are nifty, but patterns for handling this kind of thing are more complicated than they are useful.

You might try something like this instead:

public static IEnumerable<string> SplitCSV(string csvString)
{
    var sb = new StringBuilder();
    bool quoted = false;

    foreach (char c in csvString) {
        if (quoted) {
            if (c == '"')
                quoted = false;
            else
                sb.Append(c);
        } else {
            if (c == '"') {
                quoted = true;
            } else if (c == ',') {
                yield return sb.ToString();
                sb.Length = 0;
            } else {
                sb.Append(c);
            }
        }
    }

    if (quoted)
        throw new ArgumentException("csvString", "Unterminated quotation mark.");

    yield return sb.ToString();
}

It probably needs a few tweaks to follow the CSV spec exactly, but the basic logic is sound.

like image 58
cdhowie Avatar answered Oct 20 '22 07:10

cdhowie


This is a clear-cut case for a CSV parser, so you should be using .NET's own CSV parsing capabilities or cdhowie's solution.

Purely for your information and not intended as a workable solution, here's what contortions you'd have to go through using regular expressions with Regex.Split():

You could use the regex (please don't!)

(?<=^(?:[^"]*"[^"]*")*[^"]*)  # assert that there is an even number of quotes before...
\s*,\s*                       # the comma to be split on...
(?=(?:[^"]*"[^"]*")*[^"]*$)   # as well as after the comma.

if your quoted strings never contain escaped quotes, and you don't mind the quotes themselves becoming part of the match.

This is horribly inefficient, a pain to read and debug, works only in .NET, and it fails on escaped quotes (at least if you're not using "" to escape a single quote). Of course the regex could be modified to handle that as well, but then it's going to be perfectly ghastly.

like image 40
Tim Pietzcker Avatar answered Oct 20 '22 05:10

Tim Pietzcker