Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use PARSE dialect to read in a line from a CSV?

Tags:

parsing

csv

rebol

I'm trying to use PARSE to turn a CSV line into a Rebol block. Easy enough to write in open code, but as with other questions I am trying to learn what the dialect can do without that.

So if a line says:

"Look, that's ""MR. Fork"" to you!",Hostile Fork,,http://hostilefork.com

Then I want the block:

[{Look, that's "MR. Fork" to you!} {Hostile Fork} none {http://hostilefork.com}]

Issues to notice:

  • Embedded quotes in CSV strings are indicated with ""
  • Commas can be inside quotes and hence part of the literal, not a column separator
  • Adjacent column-separating commas indicate an empty field
  • Strings that don't contain quotes or commas can appear without quotes
  • For the moment we can keep things like http://rebol.com as STRING! instead of LOADing them into types such as URL!

To make it more uniform, the first thing I do is append a comma to the input line. Then I have a column-rule which captures a single column terminated by a comma...which may either be in quotes or not.

I know how many columns there should be due to the header line, so the code then says:

unless parse line compose [(column-count) column-rule] [
    print rejoin [{Expected } column-count { columns.}]
]

But I'm a bit stuck on writing column-rule. I need a way in the dialect to express "Once you find a quote, keep skipping quote pairs until you find a quote standing all on its own." What's a good way to do that?

like image 626
HostileFork says dont trust SE Avatar asked Nov 19 '12 09:11

HostileFork says dont trust SE


1 Answers

As with most parse problems, I try to build a grammar that best describes the elements of the input format.

In this case, we have nouns:

[comma ending value-chars qmark quoted-chars value header row]

Some verbs:

[row-feed emit-value]

And the operative nouns:

[current chunk current-row width]

I suppose I could possibly break it down a little more, but is enough to work with. First, the foundation:

comma: ","
ending: "^/"
qmark: {"}
value-chars: complement charset reduce [qmark comma ending]
quoted-chars: complement charset reduce [qmark]

Now the value structure. Quoted values are built up from chunks of valid chars or quotes as we find them:

current: chunk: none
quoted-value: [
    qmark (current: copy "")
    any [
        copy chunk some quoted-chars (append current chunk)
        |
        qmark qmark (append current qmark)
    ]
    qmark
]

value: [
    copy current some value-chars
    | quoted-value
]

emit-value: [
    (
        delimiter: comma
        append current-row current
    )
]

emit-none: [
    (
        delimiter: comma
        append current-row none
    )
]

Note that delimiter is set to ending at the beginning of each row, then changed to comma as soon as we pass a value. Thus, an input row is defined as [ending value any [comma value]].

All that remains is to define the document structure:

current-row: none
row-feed: [
    (
        delimiter: ending
        append/only out current-row: copy []
    )
]

width: none
header: [
    (out: copy [])
    row-feed any [
        value comma
        emit-value
    ]
    value body: ending :body
    emit-value
    (width: length? current-row)
]

row: [
    row-feed width [
        delimiter [
            value emit-value
            | emit-none
        ]
    ]
]

if parse/all stream [header some row opt ending][out]

Wrap it up to shield all those words, and you have:

REBOL [
    Title: "CSV Parser"
    Date: 19-Nov-2012
    Author: "Christopher Ross-Gill"
]

parse-csv: use [
    comma ending delimiter value-chars qmark quoted-chars
    value quoted-value header row
    row-feed emit-value emit-none
    out current current-row width
][
    comma: ","
    ending: "^/"
    qmark: {"}
    value-chars: complement charset reduce [qmark comma ending]
    quoted-chars: complement charset reduce [qmark]

    current: none
    quoted-value: use [chunk][
        [
            qmark (current: copy "")
            any [
                copy chunk some quoted-chars (append current chunk)
                |
                qmark qmark (append current qmark)
            ]
            qmark
        ]
    ]

    value: [
        copy current some value-chars
        | quoted-value
    ]

    current-row: none
    row-feed: [
        (
            delimiter: ending
            append/only out current-row: copy []
        )
    ]
    emit-value: [
        (
            delimiter: comma
            append current-row current
        )
    ]
    emit-none: [
        (
            delimiter: comma
            append current-row none
        )
    ]

    width: none
    header: [
        (out: copy [])
        row-feed any [
            value comma
            emit-value
        ]
        value body: ending :body
        emit-value
        (width: length? current-row)
    ]

    row: [
        opt ending end break
        |
        row-feed width [
            delimiter [
                value emit-value
                | emit-none
            ]
        ]
    ]

    func [stream [string!]][
        if parse/all stream [header some row][out]
    ]
]
like image 85
rgchris Avatar answered Oct 21 '22 04:10

rgchris