Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split a comma-separated string with both quoted and unquoted strings [duplicate]

Tags:

c#

regex

I have the following comma-separated string that I need to split. The problem is that some of the content is within quotes and contains commas that shouldn't be used in the split.

String:

111,222,"33,44,55",666,"77,88","99"

I want the output:

111  
222  
33,44,55  
666  
77,88  
99  

I have tried this:

(?:,?)((?<=")[^"]+(?=")|[^",]+)   

But it reads the comma between "77,88","99" as a hit and I get the following output:

111  
222  
33,44,55  
666  
77,88  
,  
99  
like image 733
Peter Norlén Avatar asked Sep 23 '10 08:09

Peter Norlén


2 Answers

Depending on your needs you may not be able to use a csv parser, and may in fact want to re-invent the wheel!!

You can do so with some simple regex

(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)

This will do the following:

(?:^|,) = Match expression "Beginning of line or string ,"

(\"(?:[^\"]+|\"\")*\"|[^,]*) = A numbered capture group, this will select between 2 alternatives:

  1. stuff in quotes
  2. stuff between commas

This should give you the output you are looking for.

Example code in C#

 static Regex csvSplit = new Regex("(?:^|,)(\"(?:[^\"]+|\"\")*\"|[^,]*)", RegexOptions.Compiled);

public static string[] SplitCSV(string input)
{

  List<string> list = new List<string>();
  string curr = null;
  foreach (Match match in csvSplit.Matches(input))
  {        
    curr = match.Value;
    if (0 == curr.Length)
    {
      list.Add("");
    }

    list.Add(curr.TrimStart(','));
  }

  return list.ToArray();
}

private void button1_Click(object sender, RoutedEventArgs e)
{
    Console.WriteLine(SplitCSV("111,222,\"33,44,55\",666,\"77,88\",\"99\""));
}

Warning As per @MrE's comment - if a rogue new line character appears in a badly formed csv file and you end up with an uneven ("string) you'll get catastrophic backtracking (https://www.regular-expressions.info/catastrophic.html) in your regex and your system will likely crash (like our production system did). Can easily be replicated in Visual Studio and as I've discovered will crash it. A simple try/catch will not trap this issue either.

You should use:

(?:^|,)(\"(?:[^\"])*\"|[^,]*)

instead

like image 165
jimplode Avatar answered Sep 23 '22 05:09

jimplode


Fast and easy:

    public static string[] SplitCsv(string line)
    {
        List<string> result = new List<string>();
        StringBuilder currentStr = new StringBuilder("");
        bool inQuotes = false;
        for (int i = 0; i < line.Length; i++) // For each character
        {
            if (line[i] == '\"') // Quotes are closing or opening
                inQuotes = !inQuotes;
            else if (line[i] == ',') // Comma
            {
                if (!inQuotes) // If not in quotes, end of current string, add it to result
                {
                    result.Add(currentStr.ToString());
                    currentStr.Clear();
                }
                else
                    currentStr.Append(line[i]); // If in quotes, just add it 
            }
            else // Add any other character to current string
                currentStr.Append(line[i]); 
        }
        result.Add(currentStr.ToString());
        return result.ToArray(); // Return array of all strings
    }

With this string as input :

 111,222,"33,44,55",666,"77,88","99"

It will return :

111  
222  
33,44,55  
666  
77,88  
99  
like image 22
Antoine Avatar answered Sep 24 '22 05:09

Antoine