Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I parse a CSV with commas embedded in quoted fields?

Tags:

c++

parsing

csv

I have tried some fixes mentioned in other answers but they had no effect on my output. I was not planning on using boost spirit as I am not sure it is the best option for my needs. Also the similar post does not deal with quoted material which contains commas, which is my last issue to resolve at this point.

This is a C++ program. I am using a CSV file as input. This file gives features of seals, there are 23 values(columns) per entry. When I output rawdata[22] I expect to see the last entry of the first set of data. Instead, I see the last entry (Petitioned) followed by the first entry (2055) of the next seal. When I open this in a hex editor I see the two words are separated by a "." and the hex character is 0a. I have tried setting \r, \n, \r\n, as delimiters but they do not work. I cannot use "," as a delimiter because it is used within strings, I tested it to see if it would work for my issue anyway and it didn't. How to separate these values?

OUTPUT:

Petitioned 2055

SAMPLE INPUT:

SpeciesID,Kingdom,Phylum,Class,Order,Family,Genus,Species,Authority,Infraspecific rank,Infraspecific name,Infraspecific authority,Stock/subpopulation,Synonyms,Common names (Eng),Common names (Fre),Common names (Spa),Red List status,Red List criteria,Red List criteria version,Year assessed,Population trend,Petitioned
2055,ANIMALIA,CHORDATA,MAMMALIA,CARNIVORA,OTARIIDAE,Arctocephalus,australis,"(Zimmermann, 1783)",,,,,Arctophoca australis,South American Fur Seal,Otarie fourrure Australe,Oso Marino Austral,LC,,3.1,2016,increasing,N
41664,ANIMALIA,CHORDATA,MAMMALIA,CARNIVORA,OTARIIDAE,Arctocephalus,forsteri,"(Lesson, 1828)",,,,,Arctocephalus australis subspecies forsteri|Arctophoca australis subspecies forsteri,"New Zealand Fur Seal, Antipodean Fur Seal, Australasian Fur Seal, Black Fur Seal, Long-nosed Fur Seal, South Australian Fur Seal",,,LC,,3.1,2015,increasing,N

my code:

#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
using namespace std;

int main() {
    string line;
    vector<string> rawdata;
    ifstream file ( "/Users/darla/Desktop/Programs/seals.csv" );
    if ( file.good() )
   {
    while(getline(file, line, '"')) {
        stringstream ss(line);
        while (getline(ss, line, ',')) {
            rawdata.push_back(line);
        }
        if (getline(file, line, '"')) {
            rawdata.push_back(line);
        }
    }
   }
    cout << rawdata[22] << endl;


    return 0;
like image 629
Mr Berry Avatar asked Oct 17 '22 00:10

Mr Berry


1 Answers

This is far from a complete CSV parser and could be made more efficient, but it does the job, parses your file correctly and deals with double quotes as well.

#include <iostream>
#include <sstream>
#include <fstream>
#include <string>
#include <vector>
#include <algorithm>

int main()
{
    std::string line;
    std::vector<std::vector<std::string>> lines;
    std::ifstream file("/Users/darla/Desktop/Programs/seals.csv");

    if (file)
    {
        while (std::getline(file, line))
        {
            size_t n = lines.size();
            lines.resize(n + 1);

            std::istringstream ss(line);
            std::string field, push_field("");
            bool no_quotes = true;

            while (std::getline(ss, field, ',')) 
            {
                if (static_cast<size_t>(std::count(field.begin(), field.end(), '"')) % 2 != 0)
                {
                    no_quotes = !no_quotes;
                }

                push_field += field + (no_quotes ? "" : ",");

                if (no_quotes)
                {
                    lines[n].push_back(push_field);
                    push_field.clear();
                }
            }
        }
    }

    for (auto line : lines)
    {
        for (auto field : line)
        {
            std::cout << "| " << field << " |";
        }

        std::cout << std::endl << std::endl;
    }

    return 0;
}

enter image description here

An explanation. The program reads file lines and tries to parse each line by fields, separated by commas, then stores the results in vector of vectors. If a field with double quotes encountered and double quotes are at odd number, this means it is an open field so more fields are read in until closing field is found, then the complete filed is stored. If field contains even number of double quotes or none, it is stored straight away. Hope this helps.

like image 54
Killzone Kid Avatar answered Nov 15 '22 09:11

Killzone Kid