I'm building a C++ CSV data parser. I'm trying to access the first and fifteenth columns of the file and read them into two arrays using getline
commands. For example:
for(int j=0;j<i;j++)
{
getline(posts2,postIDs[j],',');
for(int k=0;k<14;k++)
{
getline(posts2,tossout,',');
}
getline(posts2,answerIDs[j],',');
getline(posts2,tossout,'\r');
But, in-between the first and fifteenth columns is a column that is in quotes and contains various commas and loose quote marks. For example:
...,"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",... <
What would the best way to avoid this column be? I can't getline over it because there are quotes and commas inside of it. After running into a quote, should I read the quoted junk character-by-character until I find ", in sequence?
Also, I've seen other solutions, but all of them have been exclusive to Windows/Visual Studio. I'm running Mac OSX ver. 10.8.3 with Xcode 3.2.3.
Thanks in advance! Drew
There is no formal standard for CSV format, but let's note at the outset that the ugly column you have cited:
"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",
does not conform to what are deemed to be the Basic Rules of CSV, because two of those are:-
1) Fields with embedded commas must be quoted.
2) Each of the embedded double-quote characters must be represented by a pair of double-quote characters.
If the problem column obeys rule 1) then it doesn't obey rule 2). But we can construe it so as to obey rule 1) - so we can say where it ends - if we balance the double-quotes as, e.g.
[abc, defghijk. [Lmnopqrs, ]tuv,[] wxyz.],
The balanced outermost quotes enclose the column. The balanced internal quotes can just lack any other indication of being internal except that the balancing makes them internal.
We'd like a rule that will parse this text as one column, consistently with rule 1), and that will also parse columns that do obey the rule 2) too. The balancing just exhibited suggests this can be done, because columns that obey both rules will necessarily be balance-able too.
The suggested rule is:
If there is any even number of double-quotes up to the comma, then we know we can balance enclosing quotes and balance the rest in at least one way.
The simpler rule that you are considering:
After running into a quote, should I read the quoted junk character-by-character until I find ", in sequence?
will fail if it meets with certain columns that do obey rule 2), e.g.
"Super, ""luxurious"", truck",
The simpler rule will terminate the column after ""luxurious""
. But since
this column conforms to rule 2), adjacent double-quotes are "escaped" double-
quotes, with no delimiting significance. On the other hand the suggested
rule still parses the column correctly, terminating it after truck"
.
Here is a demo program in which the function get_csv_column
parses columns
by the suggested rule:
#include <iostream>
#include <fstream>
#include <cstdlib>
using namespace std;
/*
Assume `in` is positioned at start of column.
Accumulates chars from `in` as long as `in` is good
until either:-
- Have consumed a comma preceded by 0 quotes,or
- Have consumed a comma immediately preceded by
the last of an even number of quotes.
*/
std::string get_csv_column(ifstream & in)
{
std::string col;
unsigned quotes = 0;
char prev = 0;
bool finis = false;
for (int ch; !finis && (ch = in.get()) != EOF; ) {
switch(ch) {
case '"':
++quotes;
break;
case ',':
if (quotes == 0 || (prev == '"' && (quotes & 1) == 0)) {
finis = true;
}
break;
default:;
}
col += prev = ch;
}
return col;
}
int main()
{
ifstream in("csv.txt");
if (!in) {
cout << "Open error :(" << endl;
exit(EXIT_FAILURE);
}
for (std::string col; in; ) {
col = get_csv_column(in),
cout << "<[" << col << "]>" << std::endl;
}
if (!in && !in.eof()) {
cout << "Read error :(" << endl;
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
It encloses each column in <[...]>
, not discounting newlines, and
including the terminal ',' with each column:
The file csv.txt
is:
...,"abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",...,
",","",
Year,Make,Model,Description,Price,
1997,Ford,E350,"Super, ""luxurious"", truck",
1997,Ford,E350,"Super, ""luxurious"" truck",
1997,Ford,E350,"ac, abs, moon",3000.00,
1999,Chevy,"Venture ""Extended Edition""","",4900.00,
1999,Chevy,"Venture ""Extended Edition, Very Large""",,5000.00,
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00,
The output is:
<[...,]>
<["abc, defghijk. "Lmnopqrs, "tuv,"" wxyz.",]>
<[...,]>
<[
",",]>
<["",]>
<[
Year,]>
<[Make,]>
<[Model,]>
<[Description,]>
<[Price,]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"", truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["Super, ""luxurious"" truck",]>
<[
1997,]>
<[Ford,]>
<[E350,]>
<["ac, abs, moon",]>
<[3000.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition""",]>
<["",]>
<[4900.00,]>
<[
1999,]>
<[Chevy,]>
<["Venture ""Extended Edition, Very Large""",]>
<[,]>
<[5000.00,]>
<[
1996,]>
<[Jeep,]>
<[Grand Cherokee,]>
<["MUST SELL!
air, moon roof, loaded",]>
<[4799.00]>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With