Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is an easy way to clean an unparsable csv file

The csv file was created correctly but the name and address fields contain every piece of punctuation there is available. So when you try to import into mysql you get parsing errors. For example the name field could look like this, "john ""," doe". I have no control over the data I receive so I'm unable to stop people from inputting garbage data. From the example above you can see that if you consider the outside quotes to be the enclosing quotes then it is right but of course mysql, excel, libreoffice, and etc see a whole new field. Is there a way to fix this problem? Some fields I found even have a backslash before the last enclosing quote. I'm at a loss as I have 17 million records to import.

I have windows os and linux so whatever solution you can think of please let me know.

like image 360
cmptrwhz Avatar asked Dec 10 '25 09:12

cmptrwhz


2 Answers

This may not be a usable answer but someone needs to say it. You shouldn't have to do this. CSV is a file format with an expected data encoding. If someone is supplying you a CSV file then it should be delimited and escaped properly, otherwise its a corrupted file and you should reject it. Make the supplier re-export the file properly from whatever data store it was exported from.

If you asked someone to send you JPG and they send what was a proper JPG file with every 5th byte omitted or junk bytes inserted you wouldnt accept that and say "oh, ill reconstruct it for you".

like image 154
prodigitalson Avatar answered Dec 12 '25 21:12

prodigitalson


You don't say if you have control over the creation of the CSV file. I am assuming you do, as if not, the CVS file is corrupt and cannot be recovered without human intervention, or some very clever algorithms to "guess" the correct delimiters vs the user entered ones.

Convert user entered tabs (assuming there are some) to spaces and then export the data using TABS separator.

If the above is not possible, you need to implement an ESC sequence to ensure that user entered data is not treated as a delimiter.

like image 26
mattnz Avatar answered Dec 12 '25 23:12

mattnz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!