I have a feed that I pull data into a database from. It provides the data in XML format. However, the data includes "illegal" characters. For example:
A GREAT NEIGHBOURHOOD – WITH A
or
large “country style†eat-in
or
Garage 14’x32’, large
or
OR…….ENDLESS POSSIBILITIES!!
My question is first, how do I identify the encoding of these characters, and second, how do I change the encoding to match the UTF8 format expected by my database?
EDIT: To be clear, there's no database involved in this process (at this point in the process, anyway). The data will be inserted into the DB later, but at the moment I'm just reading the data via a PHP script and printing it on screen using var_dump
.
EDIT 2: the data is being pulled from a RETS feed using the PHP PHRETS library
The problem is that your UTF-8 response is treated in a different way or the database is not set up correctly. Here some examples on where this could happen and how to fix it.
Before Using Curl
header("Content-Type: text/html; charset=utf-8");
Mysql (my.cnf)
[client]
default-character-set=utf8
[mysql]
default-character-set=utf8
[mysqld]
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8
When Creating The Database Manually
CREATE DATABASE `your_table_name` DEFAULT CHARACTER SET utf8 COLLATE utf8_polish_ci;
When Using Frameworks such as Doctrine
$conn = array(
'driver' => 'pdo_mysql',
'dbname' => 'test',
'user' => 'root',
'password' => '*****',
'charset' => 'utf8',
'driverOptions' => array(1002=>'SET NAMES utf8')
);
It seems that at some point the XML source or data, that is UTF-8, is treated as ISO-8859-1 and converted to UTF-8. Depending on how you generate the feed this could happen at several points.
The most likely point is the encoding for the database connection. Make sure it is UTF-8.
Another possibility is the content type header you send.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With