Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to match CSV delimiters

Tags:

regex

I'm trying to create a PCRE that will match only the commas used as delimiters in a line from a CSV file. Assuming the format of a line is this:

1,"abcd",2,"de,fg",3,"hijk"

I want to match all of the commas except for the one between the 'e' and 'f'. Alternatively, matching just that one is acceptable, if that is the easier or more sensible solution. I have the sense that I need to use a negative lookahead assertion to handle this, but I'm finding it a bit too difficult to figure out.

like image 325
Kespan Avatar asked Jun 21 '11 21:06

Kespan


People also ask

What is difference [] and () in RegEx?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

What is a delimiter RegEx?

Delimiters. The first element of a regular expression is the delimiters. These are the boundaries of your regular expressions. The most common delimiter that you'll see with regular expressions is the slash ( / ) or forward slash.

How do you match a comma in RegEx?

Starting with the carat ^ indicates a beginning of line. The 0-9 indicates characters 0 through 9, the comma , indicates comma, and the semicolon indicates a ; . The closing ] indicates the end of the character set. The plus + indicates that one or more of the "previous item" must be present.

What does RegEx (? S match?

The plus sign + is a greedy quantifier, which means one or more times. For example, expression X+ matches one or more X characters. Therefore, the regular expression \s matches a single whitespace character, while \s+ will match one or more whitespace characters.


3 Answers

See my post that solves this problem for more detail.

^(?:(?:"((?:""|[^"])+)"|([^,]*))(?:$|,))+$ Will match the whole line, then you can use match.Groups[1 ].Captures to get your data out (without the quotes). Also, I let "My name is ""in quotes""" be a valid string.

like image 188
agent-j Avatar answered Oct 08 '22 16:10

agent-j


CSV parsing is a difficult problem, and has been well-solved. Whatever language you are using doubtless has a complete solution that takes care of it, without you having to go down the road of writing your own regex.

What language are you using?

like image 25
Andy Lester Avatar answered Oct 08 '22 16:10

Andy Lester


As you've already been told, a regular expression is really not appropriate; it is tricky to deal with the general case (doubly so if newlines are allowed in fields, and triply so if you might have to deal with malformed CSV data.

  • I suggest the tool CSVFIX as likely to do what you need.

To see how bad CSV can be, consider this data (with 5 clean fields, two of them empty):

"""",,"",a,"a,b"

Note that the first field contains just one double quote. Getting the two double quotes squished to one is really rather tough; you probably have to do it with a second pass after you've captured both with the regex. And consider this ill-formed data too:

"",,"",a",b c",

The problem there is that the field that starts with a contains a double quote; how to interpret it? Stop at the comma? Then the field that starts with b is similarly ill-formed. Stop at the next quote? So the field is a",b c" (or should the quotes be removed)? Etc...yuck!

This Perl gets pretty close to handling correctly both the above lines of data with a ghastly regex:

use strict;
use warnings;

my @list = ( q{"""",,"",a,"a,b"}, q{"",,"",a",b c",} );

foreach my $string (@list)
{
    print "Pattern: <<$string>>\n";
    while ($string =~ m/ (?: " ( (?:""|[^"])* ) "  |  ( [^,"] [^,]* )  |  ( .? ) )
                         (?: $ | , ) /gx)
    {
        print "Found QF: <<$1>>\n" if defined $1;
        print "Found PF: <<$2>>\n" if defined $2;
        print "Found EF: <<$3>>\n" if defined $3;
    }
}

Note that as written, you have to identify which of the three captures was actually used. With two stage processing, you could just deal with one capture and then strip out enclosing double quotes and nested doubled up double quotes. This regex assumes that if the field does not start with a double quote, then there double quote has no special meaning within the field. Have fun ringing the changes!

Output:

Pattern:  <<"""",,"",a,"a,b">>
Found QF: <<"">>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a>>
Found QF: <<a,b>>
Found EF: <<>>
Pattern:  <<"",,"",a",b c",>>
Found QF: <<>>
Found EF: <<>>
Found QF: <<>>
Found PF: <<a">>
Found PF: <<b c">>
Found EF: <<>>

We can debate whether the empty field (EF) at the end of the first pattern is correct; it probably isn't, which is why I said 'pretty close'. OTOH, the EF at the end of the second pattern is correct. Also, the extraction of two double quotes from the field """" is not the final result you want; you'd have to post-process the field to eliminate one of each adjacent pair of double quotes.

like image 33
Jonathan Leffler Avatar answered Oct 08 '22 15:10

Jonathan Leffler