Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Invalid characters in XML feed data

I have a feed that I pull data into a database from. It provides the data in XML format. However, the data includes "illegal" characters. For example:

A GREAT NEIGHBOURHOOD – WITH A

or

large “country style†eat-in

or

Garage 14’x32’, large

or

 OR…….ENDLESS POSSIBILITIES!! 

My question is first, how do I identify the encoding of these characters, and second, how do I change the encoding to match the UTF8 format expected by my database?

EDIT: To be clear, there's no database involved in this process (at this point in the process, anyway). The data will be inserted into the DB later, but at the moment I'm just reading the data via a PHP script and printing it on screen using var_dump.

EDIT 2: the data is being pulled from a RETS feed using the PHP PHRETS library

like image 907
user101289 Avatar asked Nov 12 '15 17:11

user101289


2 Answers

The problem is that your UTF-8 response is treated in a different way or the database is not set up correctly. Here some examples on where this could happen and how to fix it.

Before Using Curl

header("Content-Type: text/html; charset=utf-8");

Mysql (my.cnf)

[client]
default-character-set=utf8

[mysql]
default-character-set=utf8


[mysqld]
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
character-set-server = utf8

When Creating The Database Manually

CREATE DATABASE `your_table_name` DEFAULT CHARACTER SET utf8 COLLATE utf8_polish_ci;

When Using Frameworks such as Doctrine

$conn = array(
    'driver' => 'pdo_mysql',
    'dbname' => 'test',
    'user' => 'root',
    'password' => '*****',
    'charset' => 'utf8',
    'driverOptions' => array(1002=>'SET NAMES utf8')
);
like image 97
oshell Avatar answered Nov 16 '22 06:11

oshell


It seems that at some point the XML source or data, that is UTF-8, is treated as ISO-8859-1 and converted to UTF-8. Depending on how you generate the feed this could happen at several points.

The most likely point is the encoding for the database connection. Make sure it is UTF-8.

Another possibility is the content type header you send.

like image 4
ThW Avatar answered Nov 16 '22 07:11

ThW