I'm processing data from government sources (FEC, state voter databases, etc). It's inconsistently malformed, which breaks my CSV parser in all sorts of delightful ways.
It's externally sourced and authoritative. I must parse it, and I cannot have it re-input, validated on input, or the like. It is what it is; I don't control the input.
Properties:
Foo \xAB bar
)"foo",123,"bar"
) or unquoted (foo,123,bar
). I haven't yet encountered any where it's mixed within a given line (i.e. "foo",123,bar
) but it's probably in there.I'm using Ruby FasterCSV (known as just CSV in 1.9), but the question should be language-agnostic.
My guess is that a solution will require preprocessing substitution with unambiguous record separator / quote characters (eg ASCII RS, STX). I've started a bit here but it doesn't work for everything I get.
How can I process this kind of dirty data robustly?
ETA: Here's a simplified example of what may be in single file:
"this","is",123,"a","normal","line" "line","with "an" internal","quote" "short line","with an "internal quote", 1 comma and linebreaks" un "quot" ed,text,with,1,2,3,numbers "quoted","number","series","1,2,3" "invalid \xAB utf-8"
It is possible to subclass Ruby's File to process each line of the the CSV file before it is passed to the Ruby's CSV parser. For example, here's how I used this trick to replace non-standard backslash-escaped quotes \" with standard double-quotes ""
class MyFile < File
def gets(*args)
line = super
if line != nil
line.gsub!('\\"','""') # fix the \" that would otherwise cause a parse error
end
line
end
end
infile = MyFile.open(filename)
incsv = CSV.new(infile)
while row = incsv.shift
# process each row here
end
You could in principle do all sorts of additional processing, e.g. UTF-8 cleanups. The nice thing about this approach is you handle the file on a line by line basis, so you don't need to load it all into memory or create an intermediate file.
First, here is a rather naive attempt: http://rubular.com/r/gvh3BJaNTc
/"(.*?)"(?=[\r\n,]|$)|([^,"\s].*?)(?=[\r\n,]|$)/m
The assumptions here are:
This almost does what you want, but fails on these fields:
1 comma and linebreaks"
As TC had pointed out in the comments, your text is ambiguous. I'm sure you already know it, but for completeness:
"a"
- is that a
or "a"
? How do you represent a value that you want to be wrapped in quotes?"1","2"
- might be parsed as 1
,2
, or as 1","2
- both are legal.,1 \n 2,
- End of line, or newline in the value? You cannot tell, specially if this is supposed to be the last value of its line.1 \n 2 \n 3
- One value with newlines? Two values (1\n2
,3
or 1
,2\n3
)? Three values?You may be able to get some clues if you examine the first value on each row, which as you have said, should tell you the number of columns and their types - this can give you the additional information you are missing to parse the file (for example, if you know there should another field in this line, then all newlines belong in the current value). Even then though, it looks like there are serious problems here...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With