Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

iconv gives "Illegal Character" with smart quotes -- how to get rid of them?

I have a MySQL table with 120,000 lines stored in UTF-8 format. There is one field, product name, that contains text with many accents. I need to fill a second field with this same name after converting it to a url-friendly form (ASCII).

Since PHP doesn't directly handle UTF-8, I'm using:

$value = iconv ('UTF-8', 'ISO-8859-1', $value);

to convert the name to ISO-8859-1, followed by a massive strstr statement to replace any accented character by its unaccented equivalent (à becomes a, for example).

However, the original text names were entered with smart quotes, and iconv chokes whenever it comes across one -- I get:

Unknown error type: [8]

iconv() [function.iconv]: Detected an illegal character in input string

To get rid of the smart quotes before using iconv, I have tried using three statements like:

$value = str_replace('’', "'", $value);

(’ is the raw value of a UTF-8 smart single quote)

Because the text file is so long, these str_replace's cause the script to time out every single time.

  1. What is the fastest way to strip out the smart quotes (or any invalid characters) from a UTF-8 string, prior to running iconv?

  2. Or, is there an easier solution to this whole problem? What is the fastest way to convert a name with many accents, in UTF-8, to a name with no accents, spelled correctly, in ASCII?

like image 903
Andrew Swift Avatar asked May 26 '09 16:05

Andrew Swift


1 Answers

Glibc (and the GNU libiconv) supports //TRANSLIT and //IGNORE suffixes.

Thus, on Linux, this works just fine:

$ echo $'\xe2\x80\x99'
’
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1
iconv: illegal input sequence at position 0
$ echo $'\xe2\x80\x99' | iconv -futf8 -tiso8859-1//translit
'

I'm not sure what iconv is in use by PHP, but the documentation implies that //TRANSLIT and //IGNORE will work there too.

like image 74
ephemient Avatar answered Oct 12 '22 03:10

ephemient