Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Importing CSV that has line breaks within the actual fields

I am using PHP to import a CSV file, which originates from an excel spreadsheet. Some of the fields contain line breaks in them, so when I reopen the csv in excel / open office spreadsheet, it misinterprets where the line breaks should happen.

Also in my script, using fgetcsv to go through each line, it is incorrectly line breaking where it shouldn't be.

I could manually cleanse the data but a) that would take ages as its a 10k line file, and b) the data is exported from a clients existing piece of software

Any ideas on how to automatically solve this on the import process? I would have thought delimiting the fields would have sorted it but it does not.

like image 580
Horse Avatar asked Mar 29 '11 10:03

Horse


People also ask

How do you handle a line break in CSV?

Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

Do all lines in a CSV files have the same number of columns?

A CSV file should have the same number of columns in each row. A CSV file stores data in rows and the values in each row is separated with a separator, also known as a delimiter.

How does everyone handle comma within a field in CSV file while importing it?

If you are creating the import CSV in Excel, the quotation marks will be inserted automatically by Excel whenever a comma is detected in any cell - Saving the CSV in Excel and opening the same in Notepad reveals the enclosing quotation marks for cells containing commas.

How do I handle line breaks in a CSV file using PHP?

use implode(PHP_EOL) will do the trick.


1 Answers

It's an old thread but i encountered this problem and i solved it with a regex so you can avoid a library just for that. Here the code is in PHP but it can be adapted to other language.

$parsedCSV = preg_replace('/(,|\n|^)"(?:([^\n"]*)\n([^\n"]*))*"/', '$1"$2 $3"', $parsedCSV);

This solutions supposes the fields containing a linebreak are enclosed by double quotes, which seems to be a valid assumption, at least for what i have seen so far. Also, the double quotes should follow a , or be placed at the start of a new line (or first line).

Example:

field1,"field2-part1\nfield2-part2",field3

Here the \n is replaced by a whitespace so the result would be:

field1,"field2-part1 field2-part2",field3

The regex should handle multiple linebreaks as well.

This might not be efficient if the content is too large, but it can help for many cases and the idea can be reused, maybe optimized by doing this for smaller chunks (but you'd need to handle the cuts with fix-sized buffered).

like image 104
V. Högman Avatar answered Sep 23 '22 06:09

V. Högman