Arabic Character Encoding Issue: UTF-8 versus Windows-1256

Tags:

Quick Background: I inherited a large sql dump file containing a combination of english and arabic text and (I think) it was originally exported using 'latin1'. I changed all occurrences of 'latin1' to 'utf8' prior to importing the file. The the arabic text didn't appear correctly in phpmyadmin (which I guess is normal), but when I loaded the text to a web page with the following...

<meta http-equiv='Content-Type' content='text/html; charset=windows-1256'/>

...everything looked good and the arabic text displayed perfectly.

Problem: My client is really really really picky and doesn't want to change his...

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

...to the 'Windows-1256' equivalent. I didn't think this would be a problem, but when I changed the charset value to 'UTF-8', all of the arabic characters appeared as diamonds with question marks. Shouldn't UTF-8 display arabic text correctly?

Here are a few notes about my database configuration:

Database charset is 'utf8'
Database connection collation is 'utf8_general_ci'
All databases, tables, and applicable fields have been collated as 'utf8_general_ci'

I've been scouring stack overflow and other forums for anything the relates to my issue. I've found similar problems, but not of the solutions seem to work for my specific situation. Hope someone can help!

645

asked Dec 29 '11 22:12

ThisLanham

3 Answers

If the document looks right when declared as windows-1256 encoded, then it most probably is windows-1256 encoded. So it was apparently not exported using latin1—which would have been impossible, since latin1 has no Arabic letters.

If this is just about a single file, then the simplest way is to convert it from windows-1256 encoding to utf-8 encoding, using e.g. Notepad++. (Open the file in it, change the encoding, via File format menu, to Arabic, windows-1256. Then select Convert to UTF-8 in the File format menu and do File → Save.)

Windows-1256 and UTF-8 are completely different encodings, so data gets all messed up if you declare windows-1256 data as UTF-8 or vice versa. Only ASCII characters, such as English letters, have the same representation in both encodings.

106

answered Oct 18 '22 04:10

Jukka K. Korpela

We can't find the error in your code if you don't show us your code, so we're very limited in how we can help you.

You told the browser to interpret the document as being UTF-8 rather than Windows-1256, but did you actually change the encoding used from Windows-1256 to UTF-8?

For example,

$ cat a.pl
use strict;
use warnings;
use feature qw( say );
use charnames ':full';

my $enc = $ARGV[0] or die;
binmode STDOUT, ":encoding($enc)";

print <<"__EOI__";
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=$enc">
<title>Foo!</title>
</head>
<body dir="rtl">
\N{ARABIC LETTER ALEF}\N{ARABIC LETTER LAM}\N{ARABIC LETTER AIN}\N{ARABIC LETTER REH}\N{ARABIC LETTER BEH}\N{ARABIC LETTER YEH}\N{ARABIC LETTER TEH MARBUTA}
</body>
</html>
__EOI__

$ perl a.pl UTF-8 > utf8.html

$ perl a.pl Windows-1256 > cp1256.html

answered Oct 18 '22 06:10

ikegami

I think you need to go back to square one. It sounds like you have a database dump in Win-1256 encoding and you want to work with it in UTF-8 from now on. It also sounds like you are using PHP but you have lots of irrelevant tags on your question and are missing the most important one, PHP.

First, you need to convert the text dump into UTF-8 and you should be able to do that with PHP. Chances are that your conversion script will have two steps, first read the Win-1256 bytes and decode them into internal Unicode text strings, then encode the Unicode text strings into UTF-8 bytes for output to a new text file.

Once you have done that, redo the database import as you did before, but now you have correctly encoded the input data as UTF-8.

After that it should be as simple as reading the database and rendering a web page with the correct UTF-8 encoding.

P.S. It is actually possible to reencode the data every time you display it, but that does not solve the problem of having a database full of incorrectly encoded data.

answered Oct 18 '22 06:10

Michael Dillon

Related questions
                            
                                Highlight search resuls with string part
                            
                                How should this Many-To-Many doctrine2 association be defined?
                            
                                Models without database access in symfony2
                            
                                PHPUnit Test Suite - Cannot redeclare class Mocking & Concrete classes
                            
                                How to preview the order confirmation page
                            
                                DELETE FROM multiple conditions
                            
                                how to use chdir to change to current directory?
                            
                                PHP save remote image to Amazon S3
                            
                                Ways to simulate the yet unimplemented <bdi> HTML tag?
                            
                                Can't get mongo.so extension to load after updating PHP to 5.3
                            
                                Entry Point Error when running PHP script from command line
                            
                                PHPUnit, Interfaces and Namespaces (Symfony2)
                            
                                If I don't define a constructor for a subclass, can I use the constructor of the superclass directly?
                            
                                Why do PHP magical methods have to be public?
                            
                                Magento: Sorting a product collection
                            
                                PHP fails socket connection with Apple notification gateway
                            
                                Optimize my performance
                            
                                Class implementing an interface must be declared before a children class?
                            
                                Find most repeated sub strings in array
                            
                                How to delete all related records from different MySQL tables

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Arabic Character Encoding Issue: UTF-8 versus Windows-1256

Tags:

database

php

character-encoding

utf-8