Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ CSV Parsing with Commas Inside of Quotes

Tags:

c++

parsing

csv

I'm building a C++ CSV data parser. I'm trying to access the first and fifteenth columns of the file and read them into two arrays using getline commands. For example:

for(int j=0;j<i;j++)
{
    getline(posts2,postIDs[j],',');
    for(int k=0;k<14;k++)
    {
        getline(posts2,tossout,',');
    }
    getline(posts2,answerIDs[j],',');
    getline(posts2,tossout,'\r');

But, in-between the first and fifteenth columns is a column that is in quotes and contains various commas and loose quote marks. For example:

...,"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",... <

What would the best way to avoid this column be? I can't getline over it because there are quotes and commas inside of it. After running into a quote, should I read the quoted junk character-by-character until I find ", in sequence?

Also, I've seen other solutions, but all of them have been exclusive to Windows/Visual Studio. I'm running Mac OSX ver. 10.8.3 with Xcode 3.2.3.

Thanks in advance! Drew

like image 939
Drew Dielman Avatar asked Dec 26 '22 00:12

Drew Dielman


1 Answers

There is no formal standard for CSV format, but let's note at the outset that the ugly column you have cited:

"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",

does not conform to what are deemed to be the Basic Rules of CSV, because two of those are:-

  • 1) Fields with embedded commas must be quoted.

  • 2) Each of the embedded double-quote characters must be represented by a pair of double-quote characters.

If the problem column obeys rule 1) then it doesn't obey rule 2). But we can construe it so as to obey rule 1) - so we can say where it ends - if we balance the double-quotes as, e.g.

[abc, defghijk. [Lmnopqrs, ]tuv,[] wxyz.],

The balanced outermost quotes enclose the column. The balanced internal quotes can just lack any other indication of being internal except that the balancing makes them internal.

We'd like a rule that will parse this text as one column, consistently with rule 1), and that will also parse columns that do obey the rule 2) too. The balancing just exhibited suggests this can be done, because columns that obey both rules will necessarily be balance-able too.

The suggested rule is:

  • A column runs to the first comma that is preceded by 0 double-quotes or is preceded by the last of an even number of double-quotes.

If there is any even number of double-quotes up to the comma, then we know we can balance enclosing quotes and balance the rest in at least one way.

The simpler rule that you are considering:

After running into a quote, should I read the quoted junk character-by-character until I find ", in sequence?

will fail if it meets with certain columns that do obey rule 2), e.g.

"Super, ""luxurious"", truck",

The simpler rule will terminate the column after ""luxurious"". But since this column conforms to rule 2), adjacent double-quotes are "escaped" double- quotes, with no delimiting significance. On the other hand the suggested rule still parses the column correctly, terminating it after truck".

Here is a demo program in which the function get_csv_column parses columns by the suggested rule:

#include <iostream>
#include <fstream>
#include <cstdlib>  

using namespace std;

/*
    Assume `in` is positioned at start of column.
    Accumulates chars from `in` as long as `in` is good
    until either:-
        - Have consumed a comma preceded by 0 quotes,or
        - Have consumed a comma immediately preceded by
        the last of an even number of quotes.
*/
std::string get_csv_column(ifstream & in)
{
    std::string col;
    unsigned quotes = 0;
    char prev = 0;
    bool finis = false;
    for (int ch; !finis && (ch = in.get()) != EOF; ) {
        switch(ch) {
        case '"':
            ++quotes;
            break;
        case ',':
            if (quotes == 0 || (prev == '"' && (quotes & 1) == 0)) {
                finis = true;
            }
            break;
        default:;
        }
        col += prev = ch;
    }
    return col;
}

int main()
{
    ifstream in("csv.txt");
    if (!in) {
        cout << "Open error :(" << endl;
        exit(EXIT_FAILURE);
    }
    for (std::string col; in; ) {
        col = get_csv_column(in),
        cout << "<[" << col << "]>" << std::endl;
    }
    if (!in && !in.eof()) {
        cout << "Read error :(" << endl;
        exit(EXIT_FAILURE);
    }
    exit(EXIT_SUCCESS);
}

It encloses each column in <[...]>, not discounting newlines, and including the terminal ',' with each column:

The file csv.txt is:

...,"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",...,
",","",
Year,Make,Model,Description,Price,
1997,Ford,E350,"Super, ""luxurious"", truck",
1997,Ford,E350,"Super, ""luxurious"" truck",
1997,Ford,E350,"ac, abs, moon",3000.00,
1999,Chevy,"Venture ""Extended Edition""","",4900.00,
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00,
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00,

The output is:

<[...,]>
<["abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",]>
<[...,]>
<[
",",]>
<["",]>
<[
Year,]>
<[Make,]>
<[Model,]>
<[Description,]>
<[Price,]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"", truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"" truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["ac, abs, moon",]>
<[3000.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition""",]>
<["",]>
<[4900.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition, Very Large""",]>
<[,]>
<[5000.00,]>
<[
1996,]>
<[Jeep,]>
<[Grand Cherokee,]>
<["MUST SELL!
air, moon roof, loaded",]>
<[4799.00]>
like image 110
Mike Kinghan Avatar answered Jan 08 '23 19:01

Mike Kinghan