Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split csv whose columns may contain ,

Tags:

c#

.net

csv

People also ask

How do I split a column in CSV files Python?

Recommended Answers # data string from typical csv file # header removed csv_data = '''\ 2011/07/01 00:00,318,6.6 2011/07/01 00:15,342,5.5 2011/07/01 00:30,329,6.6 2011/07/01 00:45,279,7.5''' mylist = csv_data. split('\n') #print(mylist) # test newlist = [] for line in mylist: line_list = line.


Use the Microsoft.VisualBasic.FileIO.TextFieldParser class. This will handle parsing a delimited file, TextReader or Stream where some fields are enclosed in quotes and some are not.

For example:

using Microsoft.VisualBasic.FileIO;

string csv = "2,1016,7/31/2008 14:22,Geoff Dalgas,6/5/2011 22:21,http://stackoverflow.com,\"Corvallis, OR\",7679,351,81,b437f461b3fd27387c5d8ab47a293d35,34";

TextFieldParser parser = new TextFieldParser(new StringReader(csv));

// You can also read from a file
// TextFieldParser parser = new TextFieldParser("mycsvfile.csv");

parser.HasFieldsEnclosedInQuotes = true;
parser.SetDelimiters(",");

string[] fields;

while (!parser.EndOfData)
{
    fields = parser.ReadFields();
    foreach (string field in fields)
    {
        Console.WriteLine(field);
    }
} 

parser.Close();

This should result in the following output:

2
1016
7/31/2008 14:22
Geoff Dalgas
6/5/2011 22:21
http://stackoverflow.com
Corvallis, OR
7679
351
81
b437f461b3fd27387c5d8ab47a293d35
34

See Microsoft.VisualBasic.FileIO.TextFieldParser for more information.

You need to add a reference to Microsoft.VisualBasic in the Add References .NET tab.


It is so much late but this can be helpful for someone. We can use RegEx as bellow.

Regex CSVParser = new Regex(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");
String[] Fields = CSVParser.Split(Test);

I see that if you paste csv delimited text in Excel and do a "Text to Columns", it asks you for a "text qualifier". It's defaulted to a double quote so that it treats text within double quotes as literal. I imagine that Excel implements this by going one character at a time, if it encounters a "text qualifier", it keeps going to the next "qualifier". You can probably implement this yourself with a for loop and a boolean to denote if you're inside literal text.

public string[] CsvParser(string csvText)
{
    List<string> tokens = new List<string>();

    int last = -1;
    int current = 0;
    bool inText = false;

    while(current < csvText.Length)
    {
        switch(csvText[current])
        {
            case '"':
                inText = !inText; break;
            case ',':
                if (!inText) 
                {
                    tokens.Add(csvText.Substring(last + 1, (current - last)).Trim(' ', ',')); 
                    last = current;
                }
                break;
            default:
                break;
        }
        current++;
    }

    if (last != csvText.Length - 1) 
    {
        tokens.Add(csvText.Substring(last+1).Trim());
    }

    return tokens.ToArray();
}

You could split on all commas that do have an even number of quotes following them.

You would also like to view at the specf for CSV format about handling comma's.

Useful Link : C# Regex Split - commas outside quotes


Use a library like LumenWorks to do your CSV reading. It'll handle fields with quotes in them and will likely overall be more robust than your custom solution by virtue of having been around for a long time.


It is a tricky matter to parse .csv files when the .csv file could be either comma separated strings, comma separated quoted strings, or a chaotic combination of the two. The solution I came up with allows for any of the three possibilities.

I created a method, ParseCsvRow() which returns an array from a csv string. I first deal with double quotes in the string by splitting the string on double quotes into an array called quotesArray. Quoted string .csv files are only valid if there is an even number of double quotes. Double quotes in a column value should be replaced with a pair of double quotes (This is Excel's approach). As long as the .csv file meets these requirements, you can expect the delimiter commas to appear only outside of pairs of double quotes. Commas inside of pairs of double quotes are part of the column value and should be ignored when splitting the .csv into an array.

My method will test for commas outside of double quote pairs by looking only at even indexes of the quotesArray. It also removes double quotes from the start and end of column values.

    public static string[] ParseCsvRow(string csvrow)
    {
        const string obscureCharacter = "ᖳ";
        if (csvrow.Contains(obscureCharacter)) throw new Exception("Error: csv row may not contain the " + obscureCharacter + " character");

        var unicodeSeparatedString = "";

        var quotesArray = csvrow.Split('"');  // Split string on double quote character
        if (quotesArray.Length > 1)
        {
            for (var i = 0; i < quotesArray.Length; i++)
            {
                // CSV must use double quotes to represent a quote inside a quoted cell
                // Quotes must be paired up
                // Test if a comma lays outside a pair of quotes.  If so, replace the comma with an obscure unicode character
                if (Math.Round(Math.Round((decimal) i/2)*2) == i)
                {
                    var s = quotesArray[i].Trim();
                    switch (s)
                    {
                        case ",":
                            quotesArray[i] = obscureCharacter;  // Change quoted comma seperated string to quoted "obscure character" seperated string
                            break;
                    }
                }
                // Build string and Replace quotes where quotes were expected.
                unicodeSeparatedString += (i > 0 ? "\"" : "") + quotesArray[i].Trim();
            }
        }
        else
        {
            // String does not have any pairs of double quotes.  It should be safe to just replace the commas with the obscure character
            unicodeSeparatedString = csvrow.Replace(",", obscureCharacter);
        }

        var csvRowArray = unicodeSeparatedString.Split(obscureCharacter[0]); 

        for (var i = 0; i < csvRowArray.Length; i++)
        {
            var s = csvRowArray[i].Trim();
            if (s.StartsWith("\"") && s.EndsWith("\""))
            {
                csvRowArray[i] = s.Length > 2 ? s.Substring(1, s.Length - 2) : "";  // Remove start and end quotes.
            }
        }

        return csvRowArray;
    }

One downside of my approach is the way I temporarily replace delimiter commas with an obscure unicode character. This character needs to be so obscure, it would never show up in your .csv file. You may want to put more handling around this.


I had a problem with a CSV that contains fields with a quote character in them, so using the TextFieldParser, I came up with the following:

private static string[] parseCSVLine(string csvLine)
{
  using (TextFieldParser TFP = new TextFieldParser(new MemoryStream(Encoding.UTF8.GetBytes(csvLine))))
  {
    TFP.HasFieldsEnclosedInQuotes = true;
    TFP.SetDelimiters(",");

    try 
    {           
      return TFP.ReadFields();
    }
    catch (MalformedLineException)
    {
      StringBuilder m_sbLine = new StringBuilder();

      for (int i = 0; i < TFP.ErrorLine.Length; i++)
      {
        if (i > 0 && TFP.ErrorLine[i]== '"' &&(TFP.ErrorLine[i + 1] != ',' && TFP.ErrorLine[i - 1] != ','))
          m_sbLine.Append("\"\"");
        else
          m_sbLine.Append(TFP.ErrorLine[i]);
      }

      return parseCSVLine(m_sbLine.ToString());
    }
  }
}

A StreamReader is still used to read the CSV line by line, as follows:

using(StreamReader SR = new StreamReader(FileName))
{
  while (SR.Peek() >-1)
    myStringArray = parseCSVLine(SR.ReadLine());
}