I have given a Qt Project which needs to support Persian language.T he data is sent from a server and using the first line, I get a QByteArray and convert it to QString using the second line:
QByteArray readData = socket->readAll();
QString DataAsString = QTextCodec::codecForUtfText(readData)->toUnicode(readData);
When the data is sent is English, everything is fine, but when it is Persian, instead of
سلام
I get
سÙ\u0084اÙ\u0085
I mentioned the process so people wouldn't suggest methods to make a multi language app that uses .tr. It's all about text and decoding not those translation methods. My OS is Windows 8.1 (for the case you need to know it).
I get this hex Value when the server sends سلام
0008d8b3d984d8a7d985
By the way the server sends two extra bytes at the beginning for a reason I don't know. So I cut it off using:
DataAsString.remove(0,2);
after it's been converted to QString so the hex value has some extra at the begging.
I was far to curious to wait for reply and toyed a bit on my own:
I copied the text سلام
(in English: "Hello") and pasted it into Nodepad++ (which used UTF-8 encoding in my case). Then I switched to View as Hex and got:
The ASCII dump on right side looks a bit similar to what OP got unexpectedly. This let me believe that the bytes in readData
are encoded in UTF-8. Hence, I took the exposed hex-numbers and made a little sample code:
testQPersian.cc
:
#include <QtWidgets>
int main(int argc, char **argv)
{
QByteArray readData = "\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85";
QString textLatin1 = QString::fromLatin1(readData);
QString textUtf8 = QString::fromUtf8(readData);
QApplication app(argc, argv);
QWidget qWin;
QGridLayout qGrid;
qGrid.addWidget(new QLabel("Latin-1:"), 0, 0);
qGrid.addWidget(new QLabel(textLatin1), 0, 1);
qGrid.addWidget(new QLabel("UTF-8:"), 1, 0);
qGrid.addWidget(new QLabel(textUtf8), 1, 1);
qWin.setLayout(&qGrid);
qWin.show();
return app.exec();
}
testQPersian.pro
:
SOURCES = testQPersian.cc
QT += widgets
Compiled and tested in cygwin on Windows 10:
$ qmake-qt5 testQPersian.pro
$ make
$ ./testQPersian
Again, the output as Latin-1 looks a bit similar to what OP got as well as what Notepad++ exposed.
The output as UTF-8 provides the expected text (as expected because I provided a proper UTF-8 encoding as input).
May be, it's a bit confusing that the ASCII/Latin-1 output vary. – There exists multiple character byte encodings which share the ASCII in the lower half (0 ... 127) but have different meanings of bytes in the upper half (128 ... 255). (Have a look at ISO/IEC 8859 to see what I mean. These have been introduced as localizations before Unicode became popular as the final solution of the localization problem.)
The Persian characters have surely all Unicode codepoints beyond 127. (Unicode shares the ASCII for the first 128 codepoints as well.) Such codepoints are encoded in UTF-8 as sequences of multiple bytes where each byte has the MSB (the most significant bit – Bit 7) set. Hence, if these bytes are (accidentally) interpreted with any ISO8859 encoding then the upper half becomes relevant. Thus, depending on the currently used ISO8859 encoding, this may produce different glyphs.
Some continuation:
OP sent the following snapshot:
So, it seems instead of
d8 b3 d9 84 d8 a7 d9 85
he got
00 08 d8 b3 d9 84 d8 a7 d9 85
A possible interpretation:
The server sends first a 16 bit length 00 08
– interpreted as Big-Endian 16 bit integer: 8, then 8 bytes encoded in UTF-8 (which look exactly like the one I got with playing above).
(AFAIK, it's not unusual to use Big-Endian for binary network protocols to prevent endianess issues if sender and receiver have natively different endianess.) Further reading e.g. here: htons(3) - Linux man page
On the i386 the host byte order is Least Significant Byte first, whereas the network byte order, as used on the Internet, is Most Significant Byte first.
OP claims that this protocol is used DataOutput – writeUTF:
Writes two bytes of length information to the output stream, followed by the modified UTF-8 representation of every character in the string s. If s is null, a NullPointerException is thrown. Each character in the string s is converted to a group of one, two, or three bytes, depending on the value of the character.
So, the decoding could look like this:
QByteArray readData("\x00\x08\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85", 10);
//QByteArray readData = socket->readAll();
unsigned length
= ((uint8_t)readData[0] << 8) + (uint8_t)readData[1];
QString text = QString::fromUtf8(dataRead.data() + 2, length);
The first two bytes are extracted from readData
and combined to the length
(decoding big-endian 16 bit integer).
The rest of dataRead
is converted to QString
providing the previously extracted length
. Thereby, the first 2 length bytes of readData
are skipped.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With