Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I match a CSV-style quoted string in nom?

Tags:

csv

rust

nom

A CSV style quoted string, for the purposes of this question, is a string in which:

  1. The string starts and ends with exactly one ".
  2. Two double quotes inside the string are collapsed to one double quote. "Alo""ha"Alo"ha.
  3. "" on its own is an empty string.
  4. Error inputs, such as "A""" e", cannot be parsed. It's an A", followed by junk e".

I've tried several things, none of which have worked fully.

The closest I've gotten, thanks to some help from user pinkieval in #nom on the Mozilla IRC:

use std::error as stderror; /* Avoids needing nightly to compile */

named!(csv_style_string<&str, String>, map_res!(
   terminated!(tag!("\""), not!(peek!(char!('"')))),
   csv_string_to_string
));

fn csv_string_to_string(s: &str) -> Result<String, Box<stderror::Error>> {
   Ok(s.to_string().replace("\"\"", "\""))
}

This does not catch the end of the string correctly.

I've also attempted to use the re_match! macro with r#""([^"]|"")*""#, but that always results in an Err::Incomplete(1).

I've determined that the given CSV example for Nom 1.0 doesn't work for a quoted CSV string as I'm describing it, but I do know implementations differ.

like image 647
Fredrick Brennan Avatar asked Jun 07 '18 11:06

Fredrick Brennan


1 Answers

Here is one way of doing it:

use nom::types::CompleteStr;

use nom::*;

named!(csv_style_string<CompleteStr, String>,
    delimited!(
        char!('"'),
        map!(
            many0!(
                alt!(
                    // Eat a " delimiter and  the " that follows it
                    tag!("\"\"") => { |_| '"' }

                |    // Normal character
                    none_of!("\"")
                )
            ),
             // Make a string from a vector of chars
            |v| v.iter().collect::<String>()
        ),
        char!('"')
    )
);

fn main() {
    println!(r#""Alo\"ha" = {:?}"#, csv_style_string(CompleteStr(r#""Alo""ha""#)));
    println!(r#""" = {:?}"#, csv_style_string(CompleteStr(r#""""#)));
    println!(r#"bad format: {:?}"#, csv_style_string(CompleteStr(r#""A""" e""#)));
}

(I wrote it in full nom, but a solution like yours, based on an external function instead of map!() each character, would work too, and may be more efficient.)

The magic here, that would also solve your regexp issue, is to use CompleteStr. This basically tells nom that nothing will come after that input (otherwise, nom assumes you're doing a streaming parser, so more input may follow).

This is needed because we need to know what to do with a " if it is the last character fed to nom. Depending on the character that comes after it (another ", a normal character, or EOF), we have to take a different decision -- hence the Incomplete result, meaning nom does not have enough input to make the decision. Telling nom that EOF comes next solves this indecision.

Further reading on Incomplete on nom's author's blog: http://unhandledexpression.com/general/2018/05/14/nom-4-0-faster-safer-simpler-parsers.html#dealing-with-incomplete-usage


You may note that this parser does not actually rejects the invalid input, but parses the beginning and returns the rest. If you use this parser as a subparser in another parser, the latter would then feed the remainder to the next subparser, which would crash as well (because it would expect a comma), causing the overall parser to fail.

If you don't want that, you could make csv_style_string match peek!(alt!(char!(',')|char!('\n")|eof!())).

like image 62
Valentin Lorentz Avatar answered Nov 17 '22 13:11

Valentin Lorentz