Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java CSV parser with unescaped quotes [closed]

Tags:

java

csv

supercsv

I have a CSV file that has some quoting issues:

"Albanese Confectionery","157137","ALBANESE BULK ASST. MINI WILD FRUIT WORMS 2" 4/5LB",9,90,0,0,0,.53,"21",50137,"3441851137","5 lb",1,4,4,$6.7,$6.7,$26.8

SuperCSV is choking on these fruit worms (pun intended). I know that the 2" should probably be 2"", but it's not. LibreOffice actually parses this correctly (which surprises me). I was thinking of just writing my own little parser but other rows have commas inside the string:

"Albanese Confectionery","157230","ALBANESE BULK JET FIGHTERS,ASSORTED 4/5  B",9,90,0,0,0,.53,"21",50230,"3441851230","5 lb",1,4,4,$6.7,$6.7,$26.8

Does anyone know of a Java library that will handle crazy stuff like this? Or should I try all the available ones? Or am I better off hacking this out myself?

like image 339
Hut8 Avatar asked Feb 18 '23 07:02

Hut8


2 Answers

The right solution is to find the person who generated the data and beat them over the head with a keyboard until they fix the problem on their end.

Once you've exhausted that route, you could try some of the other CSV parsers on the market, I've used OpenCSV with success in the past.

Even if OpenCSV won't solve the problem out of the box, the code is fairly easy to read and available under an Apache license, so it might be possible to modify the algorithm to work with your wonky data, and probably easier than starting from scratch.

like image 87
JohnnyO Avatar answered Feb 28 '23 07:02

JohnnyO


Surprising even myself here, but I think I would hack it myself. I mean, you only need to read the lines and generate the tokens by splitting on quotes/commas, whichever you want. That way you can adjust the logic the way it suites you. It's not very hard. The file seems to be broken as much so that going through some existing solutions seems like more work.

One point though - if LibreOffice already parses it correctly, couldn't you just save the file from there, thus generating a file that is more reasonable. However, if you think LibreOffice might be guessing, just write the tokenizer yourself.

like image 41
eis Avatar answered Feb 28 '23 05:02

eis