Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression (C#) For CSV by RFC 4180

Tags:

c#

regex

csv

Requires universal CSV parser by specification RFC 4180. There is the csv file, with all the problems of the specification:

Excel opens the file as it is written in the specification:

Anyone does work regex for parse it?

CSV File

"a
b
c","x
y
z",357
test;test,xxx;xxx,152
"test2,test2","xxx2,xxx2",123
"test3""test3","xxx3""xxx3",987
,qwe,13
asd,123,
,,
,123,
,,123
123,,
123,123

Expected Results

Table by EXCEL

like image 779
cherrex Avatar asked Dec 07 '15 11:12

cherrex


1 Answers

NOTE: Though the solution below can likely be adapted for other regex engines, using it as-is will require that your regex engine treats multiple named capture groups using the same name as one single capture group. (.NET does this by default)


###About the pattern When one or more lines/records of a CSV file/stream (matching RFC standard 4180) are passed to the regular expression below it will return a match for each non-empty line/record. Each match will contain a capture group named Value that contains the captured values in that line/record (and potentially an OpenValue capture group if there was an open quote at the end of the line/record).

Here's the commented pattern (test it on Regexstorm.net):

(?<=\r|\n|^)(?!\r|\n|$)                       // Records start at the beginning of line (line must not be empty)
(?:                                           // Group for each value and a following comma or end of line (EOL) - required for quantifier (+?)
  (?:                                         // Group for matching one of the value formats before a comma or EOL
    "(?<Value>(?:[^"]|"")*)"|                 // Quoted value -or-
    (?<Value>(?!")[^,\r\n]+)|                 // Unquoted value -or-
    "(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|   // Open ended quoted value -or-
    (?<Value>)                                // Empty value before comma (before EOL is excluded by "+?" quantifier later)
  )
  (?:,|(?=\r|\n|$))                           // The value format matched must be followed by a comma or EOL
)+?                                           // Quantifier to match one or more values (non-greedy/as few as possible to prevent infinite empty values)
(?:(?<=,)(?<Value>))?                         // If the group of values above ended in a comma then add an empty value to the group of matched values
(?:\r\n|\r|\n|$)                              // Records end at EOL

Here's the raw pattern without all the comments or whitespace.
(?<=\r|\n|^)(?!\r|\n|$)(?:(?:"(?<Value>(?:[^"]|"")*)"|(?<Value>(?!")[^,\r\n]+)|"(?<OpenValue>(?:[^"]|"")*)(?=\r|\n|$)|(?<Value>))(?:,|(?=\r|\n|$)))+?(?:(?<=,)(?<Value>))?(?:\r\n|\r|\n|$)

[Here is a visualization from Debuggex.com][3] (capture groups named for clarity): ![Debuggex.com visualization][4]

###Usage examples:

Simple example for reading an entire CSV file/stream at once (test it on C# Pad):
(For better performance and less impact on system resources you should use the second example)

using System.Text.RegularExpressions;

Regex CSVParser = new Regex(
    @"(?<=\r|\n|^)(?!\r|\n|$)" +
    @"(?:" +
        @"(?:" +
            @"""(?<Value>(?:[^""]|"""")*)""|" +
            @"(?<Value>(?!"")[^,\r\n]+)|" +
            @"""(?<OpenValue>(?:[^""]|"""")*)(?=\r|\n|$)|" +
            @"(?<Value>)" +
        @")" +
        @"(?:,|(?=\r|\n|$))" +
    @")+?" +
    @"(?:(?<=,)(?<Value>))?" +
    @"(?:\r\n|\r|\n|$)",
    RegexOptions.Compiled);

String CSVSample =
    ",record1 value2,val3,\"value 4\",\"testing \"\"embedded double quotes\"\"\"," +
    "\"testing quoted \"\",\"\" character\", value 7,,value 9," +
    "\"testing empty \"\"\"\" embedded quotes\"," +
    "\"testing a quoted value" + Environment.NewLine +
    Environment.NewLine +
    "that includes CR/LF patterns" + Environment.NewLine +
    Environment.NewLine +
    "(which we wish would never happen - but it does)\", after CR/LF" + Environment.NewLine +
    Environment.NewLine +
    "\"testing an open ended quoted value" + Environment.NewLine +
    Environment.NewLine +
    ",value 2 ,value 3," + Environment.NewLine +
    "\"test\"";

MatchCollection CSVRecords = CSVParser.Matches(CSVSample);

for (Int32 recordIndex = 0; recordIndex < CSVRecords.Count; recordIndex++)
{
    Match Record = CSVRecords[recordIndex];

    for (Int32 valueIndex = 0; valueIndex < Record.Groups["Value"].Captures.Count; valueIndex++)
    {
        Capture c = Record.Groups["Value"].Captures[valueIndex];
        Console.Write("R" + (recordIndex + 1) + ":V" + (valueIndex + 1) + " = ");

        if (c.Length == 0 || c.Index == Record.Index || Record.Value[c.Index - Record.Index - 1] != '\"')
        {
            // No need to unescape/undouble quotes if the value is empty, the value starts
            // at the beginning of the record, or the character before the value is not a
            // quote (not a quoted value)
            Console.WriteLine(c.Value);
        }
        else
        {
            // The character preceding this value is a quote
            // so we need to unescape/undouble any embedded quotes
            Console.WriteLine(c.Value.Replace("\"\"", "\""));
        }
    }
    
    foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
        Console.WriteLine("ERROR - Open ended quoted value: " + OpenValue.Value);
}

Better example for reading a large CSV file/stream without reading the entire file/stream into a string (test it [on C# Pad][6]).
using System.IO;
using System.Text.RegularExpressions;

// Same regex from before shortened to one line for brevity
Regex CSVParser = new Regex(
    @"(?<=\r|\n|^)(?!\r|\n|$)(?:(?:""(?<Value>(?:[^""]|"""")*)""|(?<Value>(?!"")[^,\r\n]+)|""(?<OpenValue>(?:[^""]|"""")*)(?=\r|\n|$)|(?<Value>))(?:,|(?=\r|\n|$)))+?(?:(?<=,)(?<Value>))?(?:\r\n|\r|\n|$)",
    RegexOptions.Compiled);

String CSVSample = ",record1 value2,val3,\"value 4\",\"testing \"\"embedded double quotes\"\"\",\"testing quoted \"\",\"\" character\", value 7,,value 9,\"testing empty \"\"\"\" embedded quotes\",\"testing a quoted value," + 
    Environment.NewLine + Environment.NewLine + "that includes CR/LF patterns" + Environment.NewLine + Environment.NewLine + "(which we wish would never happen - but it does)\", after CR/LF," + Environment.NewLine + Environment
    .NewLine + "\"testing an open ended quoted value" + Environment.NewLine + Environment.NewLine + ",value 2 ,value 3," + Environment.NewLine + "\"test\"";

using (StringReader CSVReader = new StringReader(CSVSample))
{
    String CSVLine = CSVReader.ReadLine();
    StringBuilder RecordText = new StringBuilder();
    Int32 RecordNum = 0;

    while (CSVLine != null)
    {
        RecordText.AppendLine(CSVLine);
        MatchCollection RecordsRead = CSVParser.Matches(RecordText.ToString());
        Match Record = null;
        
        for (Int32 recordIndex = 0; recordIndex < RecordsRead.Count; recordIndex++)
        {
            Record = RecordsRead[recordIndex];
        
            if (Record.Groups["OpenValue"].Success && recordIndex == RecordsRead.Count - 1)
            {
                // We're still trying to find the end of a muti-line value in this record
                // and it's the last of the records from this segment of the CSV.
                // If we're not still working with the initial record we started with then
                // prep the record text for the next read and break out to the read loop.
                if (recordIndex != 0)
                    RecordText.AppendLine(Record.Value);
                
                break;
            }
            
            // Valid record found or new record started before the end could be found
            RecordText.Clear();            
            RecordNum++;
            
            for (Int32 valueIndex = 0; valueIndex < Record.Groups["Value"].Captures.Count; valueIndex++)
            {
                Capture c = Record.Groups["Value"].Captures[valueIndex];
                Console.Write("R" + RecordNum + ":V" + (valueIndex + 1) + " = ");
                if (c.Length == 0 || c.Index == Record.Index || Record.Value[c.Index - Record.Index - 1] != '\"')
                    Console.WriteLine(c.Value);
                else
                    Console.WriteLine(c.Value.Replace("\"\"", "\""));
            }
            
            foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
                Console.WriteLine("R" + RecordNum + ":ERROR - Open ended quoted value: " + OpenValue.Value);
        }
        
        CSVLine = CSVReader.ReadLine();
        
        if (CSVLine == null && Record != null)
        {
            RecordNum++;
            
            //End of file - still working on an open value?
            foreach (Capture OpenValue in Record.Groups["OpenValue"].Captures)
                Console.WriteLine("R" + RecordNum + ":ERROR - Open ended quoted value: " + OpenValue.Value);
        }
    }
}

Both examples return the same result of:

R1:V1 =
R1:V2 = record1 value2
R1:V3 = val3
R1:V4 = value 4
R1:V5 = testing "embedded double quotes"
R1:V6 = testing quoted "," character
R1:V7 = value 7
R1:V8 =
R1:V9 = value 9
R1:V10 = testing empty "" embedded quotes
R1:V11 = testing a quoted value

that includes CR/LF patterns

(which we wish would never happen - but it does)
R1:V12 = after CR/LF
ERROR - Open ended quoted value: testing an open ended quoted value

,value 2 ,value 3,

R3:V1 = test

(Note the bold "ERROR..." line demonstrating that the open ended quoted value - testing an open ended quoted value - has caused the regex to match that value, and all subsequent values until the properly quoted "test" value, as an error captured in the OpenValue group)


###Key features over other regex solutions I found prior to this:

  • Support for quoted values with embedded/escaped quotes.

  • Support for quoted values that span multiple lines
    value1,"value 2 line 1 value 2 line 2",value3

  • Empty values are retained/captured (other than empty lines which aren't explicitly covered in RFC standard 4180 and are assumed to be in error by this regex. This can be changed by removing the second group pattern - (?!\r|\n|$) - from the regex)

  • Lines/records may end with CR+LF or just CR or LF

  • Parses multiple lines/records of a CSV at once returning a match for each record and group(s) for the values within the record (thanks to .NET's ability to capture multiple values into a single named capture group).

  • Keeps the majority of the parsing logic in the regex itself. You shouldn't need to pass CSV to this regex and then check for condition x, y, or z in your code to get the actual values (exceptions highlighted in the limitations below).


###Limitations (workarounds require application logic external to the regex):

  • The record matches can not be reliably limited by quantifying the value pattern in the regex. That is to say, using something like (<value pattern>){10}(\r\n|\r|\n|$) instead of (<value pattern>)+?(\r\n|\r|\n|$) will possibly limit your line/record matches to only those that contain ten values. But, it will also force the pattern to try to match only ten values even if it means splitting one value into two values or capturing nine empty values in the space of one empty value to do so.

  • Escaped/Doubled quote characters are not "unescaped/un-doubled".

  • Records/Lines with open ended quoted values (missing the closing quote) are only supported for debugging purposes. External logic would be required to determine how to better handle this situation by performing additional parsing on the OpenValue capture group.

Since the rules for how to handle this situation are not defined in the RFC standard, this behavior would need to be defined by the application anyway. However, I think the behavior of the regex pattern when this happens is pretty good (captures everything between the open quote and the next valid record as part of the open value).

NOTE: The pattern can be changed to fail earlier (or not at all) and not capture subsequent values (for example by removing the OpenValue capture from the regex). But, in general this causes other bugs to crop up.


###Why?: I'd like to address a common question before it gets asked - "Why did you put the effort into creating this complicated regex pattern instead of using solution X which is faster, better, or whatever?"

I realize there are hundreds of regex answers to this out there, but I couldn't find one that lived up to my high expectations. Most of those expectations are covered by RFC standard 4180 referenced in the question, but primarily/additionally was capture of quoted values that spanned multiple lines and the ability to parse multiple lines/records (or the entire CSV content) with the regex if needed rather than passing in one line at a time to the regex.

I also realize most people are abandoning the regex approach for the TextFieldParser or other libraries (such as FileHelpers) to handle CSV parsing. And, that's great - glad it worked for you. I chose not to use those because:

  • (Main reason) I considered it a challenge to do it in regex and I love a good challenge.

  • The TextFieldParser actually falls short of the requirements because it doesn't handle fields that may or may not have quotes within the file. Some CSV files only quote values when needed in order to save space. (It may fall short in other ways, but that one keeps me from even trying it)

  • I don't like depending on third part libraries for several reasons, but mostly because I can't control their compatibility (ie. does it work with OS/framework X?), security vulnerabilities, or timely bugfixes and/or maintenance.

like image 55
David Woodward Avatar answered Oct 25 '22 07:10

David Woodward