Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to split a CSV

Description

Instead of using a split, I think it would be easier to simply execute a match and process all the found matches.

This expression will:

  • divide your sample text on the comma delimits
  • will process empty values
  • will ignore double quoted commas, providing double quotes are not nested
  • trims the delimiting comma from the returned value
  • trims surrounding quotes from the returned value

Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)

enter image description here

Example

Sample Text

123,2.99,AMO024,Title,"Description, more info",,123987564

ASP example using the non-java expression

Set regEx = New RegExp
regEx.Global = True
regEx.IgnoreCase = True
regEx.MultiLine = True
sourcestring = "your source string"
regEx.Pattern = "(?:^|,)(?=[^""]|("")?)""?((?(1)[^""]*|[^,""]*))""?(?=,|$)"
Set Matches = regEx.Execute(sourcestring)
  For z = 0 to Matches.Count-1
    results = results & "Matches(" & z & ") = " & chr(34) & Server.HTMLEncode(Matches(z)) & chr(34) & chr(13)
    For zz = 0 to Matches(z).SubMatches.Count-1
      results = results & "Matches(" & z & ").SubMatches(" & zz & ") = " & chr(34) & Server.HTMLEncode(Matches(z).SubMatches(zz)) & chr(34) & chr(13)
    next
    results=Left(results,Len(results)-1) & chr(13)
  next
Response.Write "<pre>" & results

Matches using the non-java expression

Group 0 gets the entire substring which includes the comma
Group 1 gets the quote if it's used
Group 2 gets the value not including the comma

[0][0] = 123
[0][1] = 
[0][2] = 123

[1][0] = ,2.99
[1][1] = 
[1][2] = 2.99

[2][0] = ,AMO024
[2][1] = 
[2][2] = AMO024

[3][0] = ,Title
[3][1] = 
[3][2] = Title

[4][0] = ,"Description, more info"
[4][1] = "
[4][2] = Description, more info

[5][0] = ,
[5][1] = 
[5][2] = 

[6][0] = ,123987564
[6][1] = 
[6][2] = 123987564

Worked on this for a bit and came up with this solution:

(?:,|\n|^)("(?:(?:"")*[^"]*)*"|[^",\n]*|(?:\n|$))

Try it out here!

This solution handles "nice" CSV data like

"a","b",c,"d",e,f,,"g"

0: "a"
1: "b"
2: c
3: "d"
4: e
5: f
6:
7: "g"

and uglier things like

"""test"" one",test' two,"""test"" 'three'","""test 'four'"""

0: """test"" one"
1: test' two
2: """test"" 'three'"
3: """test 'four'"""

Here's an explanation of how it works:

(?:,|\n|^)      # all values must start at the beginning of the file,  
                #   the end of the previous line, or at a comma  
(               # single capture group for ease of use; CSV can be either...  
  "             # ...(A) a double quoted string, beginning with a double quote (")  
    (?:         #        character, containing any number (0+) of  
      (?:"")*   #          escaped double quotes (""), or  
      [^"]*     #          non-double quote characters  
    )*          #        in any order and any number of times  
  "             #        and ending with a double quote character  

  |             # ...or (B) a non-quoted value  

  [^",\n]*      # containing any number of characters which are not  
                # double quotes ("), commas (,), or newlines (\n)  

  |             # ...or (C) a single newline or end-of-file character,  
                #           used to capture empty values at the end of  
  (?:\n|$)      #           the file or at the ends of lines  
)

I created this a few months ago for a project.

 ".+?"|[^"]+?(?=,)|(?<=,)[^"]+

Regular expression visualization

It works in C# and the Debuggex was happy when I selected Python and PCRE. Javascript doesn't recognize this form of Proceeded By ?<=....

For your values, it will create matches on

123
,2.99
,AMO024
,Title
"Description, more info"
,
,123987564

Note that anything in quotes doesn't have a leading comma, but attempting to match with a leading comma was required for the empty value use case. Once done, trim values as necessary.

I use RegexHero.Net to test my Regex.


I'm late to the party, but the following is the Regular Expression I use:

(?:,"|^")(""|[\w\W]*?)(?=",|"$)|(?:,(?!")|^(?!"))([^,]*?)(?=$|,)|(\r\n|\n)

This pattern has three capturing groups:

  1. Contents of a quoted cell
  2. Contents of an unquoted cell
  3. A new line

This pattern handles all of the following:

  • Normal cell contents without any special features: one,2,three
  • Cell containing a double quote (" is escaped to ""): no quote,"a ""quoted"" thing",end
  • Cell contains a newline character: one,two\nthree,four
  • Normal cell contents which have an internal quote: one,two"three,four
  • Cell contains quotation mark followed by comma: one,"two ""three"", four",five

See this pattern in use.

If you have are using a more capable flavor of regex with named groups and lookbehinds, I prefer the following:

(?<quoted>(?<=,"|^")(?:""|[\w\W]*?)*(?=",|"$))|(?<normal>(?<=,(?!")|^(?!"))[^,]*?(?=(?<!")$|(?<!"),))|(?<eol>\r\n|\n)

See this pattern in use.

Edit

(?:^"|,")(""|[\w\W]*?)(?=",|"$)|(?:^(?!")|,(?!"))([^,]*?)(?=$|,)|(\r\n|\n)

This slightly modified pattern handles lines where the first column is empty as long as you are not using Javascript. For some reason Javascript will omit the second column with this pattern. I was unable to correctly handle this edge-case.


I needed this answer too, but I found the answers, while informative, a little hard to follow and replicate for other languages. Here is the simplest expression I came up with for a single column out of the CSV line. I am not splitting. I'm building a regex to match a column out of the CSV so I'm not splitting the line:

("([^"]*)"|[^,]*)(,|$)

This matches a single column from the CSV line. The first portion "([^"]*)" of the expression is to match a quoted entry, the second part [^,]* is to match a non-quoted entry. Then either followed by a , or end of line $.

And the accompanying debuggex to test out the expression.

https://www.debuggex.com/r/s4z_Qi2gZiyzpAhx


I personally tried many RegEx expressions without having found the perfect one that match all cases.

I think that regular expressions is hard to configure properly to match all cases properly. Although few persons will not like the namespace (and I was part of them), I propose something that is part of the .Net framework and give me proper results all the times in all cases (mainly managing every double quotes cases very well):

Microsoft.VisualBasic.FileIO.TextFieldParser

Found it here: StackOverflow

Example of usage:

TextReader textReader = new StringReader(simBaseCaseScenario.GetSimStudy().Study.FilesToDeleteWhenComplete);
Microsoft.VisualBasic.FileIO.TextFieldParser textFieldParser = new TextFieldParser(textReader);
textFieldParser.SetDelimiters(new string[] { ";" });
string[] fields = textFieldParser.ReadFields();
foreach (string path in fields)
{
    ...

Hope it could help.