Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding UTF-8 support to JS/PHP script [duplicate]

I am working on a page that uses JavaScipt to send data to a PHP script via AJAX POST. The problem is, if the input is in a language that is not Latin based I end-up storing gibberish in the MySQL table. Latin alphabet works fine.

The page itself is capable to rendering UTF-8 characters, if they are in a data provided on page load, it's the post that I struggle with.

اختبار

and save. See the Network POST request in browser's dev tools.

The post is made through the following JS function

function createEmptyStack(stackTitle) {
    return $.ajax({
        type:'POST',
        url:'ajax.php',
        data: {
            "do": 'createEmptyStack',
            newTitle: stackTitle
        },
        dataType: "json"
    });
}

Here's my PHP code.

header('Content-Type: text/html; charset=utf-8');

$newTitle = trim($_POST['newTitle']);

$db->query("
INSERT INTO t1(project_id, label) 
VALUES (".$_SESSION['project_id'].", '".$newTitle."')");

When I check for encoding on the page like this:

mb_detect_encoding($_POST['newTitle'], "auto");

I get result: UTF-8

I also tried the following header:

header("Content-type: application/json; charset=utf-8");

MySQL table collation where the data is supposed to go is set to utf8_general_ci

I have another page that has a form where users populate the same table and it works perfectly fine with ANY language. When I check on the other page why it is capable of inserting similar data into db successfully I see the following above insert query:

mysql_query("SET NAMES utf8");

I've attempted putting the same line above my query that the data still looks gibberish. I also tried the following couple alternatives:

 mysql_query("SET CHARACTER SET utf8 ");

and

mysql_set_charset('utf8', $db);

...but to no avail. I'm stomped. Need help getting it figured out.

Environment:

PHP 5.6.40 (cgi-fcgi)

MySQL 5.6.45


UPDATE

I ran more tests.

I used a phrase "this is a test" in Arabic - هذا اختبار

It seems that ajax.php code works properly. After db insert it returns UTF-8 encoded values, that look like: "\u0647\u0630\u0627 \u0627\u062e\u062a\u0628\u0627\u0631" and the encoding is set to:"UTF-8", however the inserted data in my db table appears as: هذا اختبار

So why am I not jumping to converting my db table to different collation? Couple of reasons: it has nearly .5 mil records and it actually works properly when I go to another page that does very similar INSERT.

Turns out my other page is using ASCII encoding when inserting the data. So it's only natural I try to conver to ASCII on ajax.php. The problem I end-up with blank data. I am so confused now...

Thanks


FIXED: based on a few clues I ended-up rewriting all functions for this page to PDO and it worked!

like image 950
santa Avatar asked Nov 19 '19 05:11

santa


2 Answers

المراكز is Mojibake, or possibly "double encoding", for المراكز -- Please do SELECT col, hex(col) ... to see which of these looks like:

Mojibake: D8A7D984D985D8B1D8A7D983D8B2
double encoding: C398C2A7C399E2809EC399E280A6C398C2B1C398C2A7C399C692C398C2B2

If Mojibake:

  • The bytes to be stored need to be UTF-8-encoded. Fix this.
  • The connection when INSERTing and SELECTing text needs to specify utf8 or utf8mb4. Fix this.
  • The column needs to be declared CHARACTER SET utf8 (or utf8mb4). Fix this.
  • HTML should start with <meta charset=UTF-8>.

If double-encoding: This is caused by converting from latin1 (or whatever) to utf8, then treating those bytes as if they were latin1 and repeating the conversion.

More discussion:

Trouble with UTF-8 characters; what I see is not what I stored

Do not use the mysql_* interface in PHP; switch to mysqli_* or PDO interfaces. mysql_* was removed in PHP 5.7.

like image 117
Rick James Avatar answered Sep 30 '22 03:09

Rick James


If your database is latin1, it will store unicode characters as multi-byte characters. If it's utf-8 based, it will still store multiple characters but displayed in a more "sensible" manner.

If, your ر character is represented as XYZ (3 bytes), then when you retrieve XYZ, the browser will reassemble them into a visible ر.

However, if your database is utf-8, it'll further encode each component, so that you are "reliably" seeing XYZ in the end. Let's say X is denoted as x1,x2, and Y is just y, and Z is z1,z2,z3, so instead of seeing ر, which is stored as XYZ, you now see x1x2yz1z2z3, which is shown as XYZ.

Try converting your database to latin1 to at least confirm my theory. Thanks.

Edit:

There is no need to use a utf8 js library. Make sure your page's character encoding is utf8:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

When you POST the data, you can encode it with encodeURIComponent before sending with a XHR request. I'm not sure whether the jQuery flavor of $.ajax already does the encoding.

like image 37
Schien Avatar answered Sep 30 '22 03:09

Schien