Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fgetcsv() drops characters with diacritics (i.e. non-ASCII) - how to fix?

Similar questions:
Some characters in CSV file are not read during PHP fgetcsv() ,
fgetcsv() ignores special characters when they are at the beginning of line

My application has a form where the users can upload a CSV file (its 5 internal users have always uploaded a valid file - comma-delimited, quoted, records end by LF), and the file is then imported into a database using PHP:

$fhandle = fopen($uploaded_file,'r');
while($row = fgetcsv($fhandle, 0, ',', '"', '\\')) {
    print_r($row);
    // further code not relevant as the data is already corrupt at this point
}

For reasons I cannot change, the users are uploading the file encoded in the Windows-1250 charset - a single-byte, 8-bit character encoding.

The problem: and some (not all!) characters beyond 127 ("extended ASCII") are dropped in fgetcsv(). Example data:

"15","Ústav"
"420","Špičák"
"7","Tmaň"

becomes

Array (
  0 => 15
  1 => "stav"
)
Array (
  0 => 420
  1 => "pičák"
)
Array (
  0 => 7
  1 => "Tma"
)

(Note that č is kept, but Ú is dropped)

The documentation for fgetcsv says that "since 4.3.5 fgetcsv() is now binary safe", but looks like it isn't. Am I doing something wrong, or is this function broken and I should look for a different way to parse CSV?

like image 206
Piskvor left the building Avatar asked Dec 13 '22 19:12

Piskvor left the building


1 Answers

It turns out that I didn't read the documentation well enough - fgetcsv() is only somewhat binary-safe. It is safe for plain ASCII < 127, but the documentation also says:

Note:

Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function

In other words, fgetcsv() tries to be binary-safe, but it's actually not (because it's also messing with the charset at the same time), and it will probably mangle the data it reads (as this setting is not configured in php.ini, but rather read from $LANG).

I've sidestepped the issue by reading the lines with fgets (which works on bytes, not characters) and using a CSV function from the comment in the docs to parse them into an array:

$fhandle = fopen($uploaded_file,'r');
while($raw_row = fgets($fhandle)) { // fgets is actually binary safe
    $row = csvstring_to_array($raw_row, ',', '"', "\n");
    // $row is now read correctly
}
like image 112
Piskvor left the building Avatar answered Dec 19 '22 09:12

Piskvor left the building