I am using PHP to import a CSV file, which originates from an excel spreadsheet. Some of the fields contain line breaks in them, so when I reopen the csv in excel / open office spreadsheet, it misinterprets where the line breaks should happen.
Also in my script, using fgetcsv to go through each line, it is incorrectly line breaking where it shouldn't be.
I could manually cleanse the data but a) that would take ages as its a 10k line file, and b) the data is exported from a clients existing piece of software
Any ideas on how to automatically solve this on the import process? I would have thought delimiting the fields would have sorted it but it does not.
Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
A CSV file should have the same number of columns in each row. A CSV file stores data in rows and the values in each row is separated with a separator, also known as a delimiter.
If you are creating the import CSV in Excel, the quotation marks will be inserted automatically by Excel whenever a comma is detected in any cell - Saving the CSV in Excel and opening the same in Notepad reveals the enclosing quotation marks for cells containing commas.
use implode(PHP_EOL) will do the trick.
It's an old thread but i encountered this problem and i solved it with a regex so you can avoid a library just for that. Here the code is in PHP but it can be adapted to other language.
$parsedCSV = preg_replace('/(,|\n|^)"(?:([^\n"]*)\n([^\n"]*))*"/', '$1"$2 $3"', $parsedCSV);
This solutions supposes the fields containing a linebreak are enclosed by double quotes, which seems to be a valid assumption, at least for what i have seen so far. Also, the double quotes should follow a ,
or be placed at the start of a new line (or first line).
Example:
field1,"field2-part1\nfield2-part2",field3
Here the \n is replaced by a whitespace so the result would be:
field1,"field2-part1 field2-part2",field3
The regex should handle multiple linebreaks as well.
This might not be efficient if the content is too large, but it can help for many cases and the idea can be reused, maybe optimized by doing this for smaller chunks (but you'd need to handle the cuts with fix-sized buffered).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With