Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse csv using boost::spirit

I have this csv line

std::string s = R"(1997,Ford,E350,"ac, abs, moon","some "rusty" parts",3000.00)";

I can parse it using boost::tokenizer:

typedef boost::tokenizer< boost::escaped_list_separator<char> , std::string::const_iterator, std::string> Tokenizer;
boost::escaped_list_separator<char> seps('\\', ',', '\"');
Tokenizer tok(s, seps);
for (auto i : tok)
{
    std::cout << i << std::endl;
}

It gets it right except token "rusty" should have double quotes which are getting stripped.

Here is my attempt to use boost::spirit

boost::spirit::classic::rule<> list_csv_item = !(boost::spirit::classic::confix_p('\"', *boost::spirit::classic::c_escape_ch_p, '\"') | boost::spirit::classic::longest_d[boost::spirit::classic::real_p | boost::spirit::classic::int_p]);
std::vector<std::string> vec_item;
std::vector<std::string>  vec_list;
boost::spirit::classic::rule<> list_csv = boost::spirit::classic::list_p(list_csv_item[boost::spirit::classic::push_back_a(vec_item)],',')[boost::spirit::classic::push_back_a(vec_list)];
boost::spirit::classic::parse_info<> result = parse(s.c_str(), list_csv);
if (result.hit)
{
  for (auto i : vec_item)
  {
    cout << i << endl;
   }
}

Problems:

  1. does not work, prints the first token only

  2. why boost::spirit::classic? can't find examples using Spirit V2

  3. the setup is brutal .. but I can live with this

** I really want to use boost::spirit because it tends to be pretty fast

Expected output:

1997
Ford
E350
ac, abs, moon
some "rusty" parts

3000.00

like image 826
user841550 Avatar asked Aug 21 '13 18:08

user841550


2 Answers

For a background on parsing (optionally) quoted delimited fields, including different quoting characters (', "), see here:

  • Parse quoted strings with boost::spirit

For a very, very, very complete example complete with support for partially quoted values and a

splitInto(input, output, ' ');

method that takes 'arbitrary' output containers and delimiter expressions, see here:

  • How to make my split work only on one real line and be capable to skip quoted parts of string?

Addressing your exact question, assuming either quoted or unquoted fields (no partial quotes inside field values), using Spirit V2:

Let's take the simplest 'abstract datatype' that could possibly work:

using Column  = std::string;
using Columns = std::vector<Column>;
using CsvLine = Columns;
using CsvFile = std::vector<CsvLine>;

And the repeated double-quote escapes a double-quote semantics (as I pointed out in the comment), you should be able to use something like:

static const char colsep = ',';

start  = -line % eol;
line   = column % colsep;
column = quoted | *~char_(colsep);
quoted = '"' >> *("\"\"" | ~char_('"')) >> '"';

The following complete test program prints

[1997][Ford][E350][ac, abs, moon][rusty][3001.00]

(Note the BOOST_SPIRIT_DEBUG define for easy debugging). See it Live on Coliru

Full Demo

//#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

using Column  = std::string;
using Columns = std::vector<Column>;
using CsvLine = Columns;
using CsvFile = std::vector<CsvLine>;

template <typename It>
struct CsvGrammar : qi::grammar<It, CsvFile(), qi::blank_type>
{
    CsvGrammar() : CsvGrammar::base_type(start)
    {
        using namespace qi;

        static const char colsep = ',';

        start  = -line % eol;
        line   = column % colsep;
        column = quoted | *~char_(colsep);
        quoted = '"' >> *("\"\"" | ~char_('"')) >> '"';

        BOOST_SPIRIT_DEBUG_NODES((start)(line)(column)(quoted));
    }
  private:
    qi::rule<It, CsvFile(), qi::blank_type> start;
    qi::rule<It, CsvLine(), qi::blank_type> line;
    qi::rule<It, Column(),  qi::blank_type> column;
    qi::rule<It, std::string()> quoted;
};

int main()
{
    const std::string s = R"(1997,Ford,E350,"ac, abs, moon","""rusty""",3001.00)";

    auto f(begin(s)), l(end(s));
    CsvGrammar<std::string::const_iterator> p;

    CsvFile parsed;
    bool ok = qi::phrase_parse(f,l,p,qi::blank,parsed);

    if (ok)
    {
        for(auto& line : parsed) {
            for(auto& col : line)
                std::cout << '[' << col << ']';
            std::cout << std::endl;
        }
    } else
    {
        std::cout << "Parse failed\n";
    }

    if (f!=l)
        std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
}
like image 69
sehe Avatar answered Nov 14 '22 22:11

sehe


Sehe's post looks a fair bit cleaner than mine, but I was putting this together for a bit, so here it is anyways:

#include <boost/tokenizer.hpp>
#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

int main() {
    const std::string s = R"(1997,Ford,E350,"ac, abs, moon",""rusty"",3000.00)";

    // Tokenizer
    typedef boost::tokenizer< boost::escaped_list_separator<char> , std::string::const_iterator, std::string> Tokenizer;
    boost::escaped_list_separator<char> seps('\\', ',', '\"');
    Tokenizer tok(s, seps);
    for (auto i : tok)
        std::cout << i << "\n";
    std::cout << "\n";

    // Boost Spirit Qi
    qi::rule<std::string::const_iterator, std::string()> quoted_string = '"' >> *(qi::char_ - '"') >> '"';
    qi::rule<std::string::const_iterator, std::string()> valid_characters = qi::char_ - '"' - ',';
    qi::rule<std::string::const_iterator, std::string()> item = *(quoted_string | valid_characters );
    qi::rule<std::string::const_iterator, std::vector<std::string>()> csv_parser = item % ',';

    std::string::const_iterator s_begin = s.begin();
    std::string::const_iterator s_end = s.end();
    std::vector<std::string> result;

    bool r = boost::spirit::qi::parse(s_begin, s_end, csv_parser, result);
    assert(r == true);
    assert(s_begin == s_end);

    for (auto i : result)
        std::cout << i << std::endl;
    std::cout << "\n";
}   

And this outputs:

1997
Ford
E350
ac, abs, moon
rusty
3000.00

1997
Ford
E350
ac, abs, moon
rusty
3000.00

Something Worth Noting: This doesn't implement a full CSV parser. You'd also want to look into escape characters or whatever else is required for your implementation.

Also: If you're looking into the documentation, just so you know, in Qi, 'a' is equivalent to boost::spirit::qi::lit('a') and "abc" is equivalent to boost::spirit::qi::lit("abc").

On Double quotes: So, as Sehe notes in a comment above, it's not directly clear what the rules surrounding a "" in the input text means. If you wanted all instances of "" not within a quoted string to be converted to a ", then something like the following would work.

qi::rule<std::string::const_iterator, std::string()> double_quote_char = "\"\"" >> qi::attr('"');
qi::rule<std::string::const_iterator, std::string()> item = *(double_quote_char | quoted_string | valid_characters );
like image 38
Bill Lynch Avatar answered Nov 15 '22 00:11

Bill Lynch