Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with fields containing unescaped double quotes with TextFieldParser

I am trying to import a CSV file using TextFieldParser. A particular CSV file is causing me problems due to its nonstandard formatting. The CSV in question has its fields enclosed in double quotes. The problem appears when there is an additional set of unescaped double quotes within a particular field.

Here is an oversimplified test case that highlights the problem. The actual CSV files I am dealing with are not all formatted the same and have dozens of fields, any of which may contain these possibly tricky formatting issues.

TextReader reader = new StringReader("\"Row\",\"Test String\"\n" +
    "\"1\",\"This is a test string.  It is parsed correctly.\"\n" +
    "\"2\",\"This is a test string with a comma,  which is parsed correctly\"\n" +
    "\"3\",\"This is a test string with double \"\"double quotes\"\". It is parsed correctly\"\n" +
    "\"4\",\"This is a test string with 'single quotes'. It is parsed correctly\"\n" +
    "5,This is a test string with fields that aren't enclosed in double quotes.  It is parsed correctly.\n" +
    "\"6\",\"This is a test string with single \"double quotes\".  It can't be parsed.\"");

using (TextFieldParser parser = new TextFieldParser(reader))
{
    parser.Delimiters = new[] { "," };
    while (!parser.EndOfData)
    {
        string[] fields= parser.ReadFields();
        Console.WriteLine("This line was parsed as:\n{0},{1}",
            fields[0], fields[1]);
    }
}

Is there anyway to properly parse a CSV with this type of formatting using TextFieldParser?

like image 337
sglantz Avatar asked Apr 25 '13 22:04

sglantz


2 Answers

I agree with Hans Passant's advice that it is not your responsibility to parse malformed data. However, in accord with the Robustness Principle, some one faced with this situation may attempt to handle specific types of malformed data. The code I wrote below works on the data set specified in the question. Basically it detects the parser error on the malformed line, determines if it is double-quote wrapped based on the first character, and then splits/strips all the wrapping double-quotes manually.

using (TextFieldParser parser = new TextFieldParser(reader))
{
    parser.Delimiters = new[] { "," };

    while (!parser.EndOfData)
    {
        string[] fields = null;
        try
        {
            fields = parser.ReadFields();
        }
        catch (MalformedLineException ex)
        {
            if (parser.ErrorLine.StartsWith("\""))
            {
                var line = parser.ErrorLine.Substring(1, parser.ErrorLine.Length - 2);
                fields = line.Split(new string[] { "\",\"" }, StringSplitOptions.None);
            }
            else
            {
                throw;
            }
        }
        Console.WriteLine("This line was parsed as:\n{0},{1}", fields[0], fields[1]);
    }
}

I'm sure it is possible to concoct a pathological example where this fails (e.g. commas adjacent to double-quotes within a field value) but any such examples would probably be unparseable in the strictest sense, whereas the problem line given in the question is decipherable despite being malformed.

like image 170
Jordan Rieger Avatar answered Nov 13 '22 23:11

Jordan Rieger


Jordan's solution is quite good, but it makes an incorrect assumption that the error line will always begin with a double-quote. My error line was this:

170,"CMS ALT",853,,,NON_MOVEX,COM,NULL,"2014-04-25",""  204 Route de Trays"

Notice the last field had extra/unescaped double quotes, but the first field was fine. So Jordan's solution didn't work. Here is my modified solution based on Jordan's:

using(TextFieldParser parser = new TextFieldParser(new StringReader(csv))) {
 parser.Delimiters = new [] {","};

 while (!parser.EndOfData) {
  string[] fields = null;
  try {
   fields = parser.ReadFields();
  } catch (MalformedLineException ex) {
   string errorLine = SafeTrim(parser.ErrorLine);
   fields = errorLine.Split(',');
  }
 }
}

You may want to handle the catch block differently, but the general concept works great for me.

like image 28
HerrimanCoder Avatar answered Nov 13 '22 21:11

HerrimanCoder