Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

QString in Persian

I have given a Qt Project which needs to support Persian language.T he data is sent from a server and using the first line, I get a QByteArray and convert it to QString using the second line:

    QByteArray readData = socket->readAll();
    QString DataAsString = QTextCodec::codecForUtfText(readData)->toUnicode(readData);

When the data is sent is English, everything is fine, but when it is Persian, instead of

سلام

I get

سÙ\u0084اÙ\u0085

I mentioned the process so people wouldn't suggest methods to make a multi language app that uses .tr. It's all about text and decoding not those translation methods. My OS is Windows 8.1 (for the case you need to know it).

I get this hex Value when the server sends سلام

0008d8b3d984d8a7d985

By the way the server sends two extra bytes at the beginning for a reason I don't know. So I cut it off using:

DataAsString.remove(0,2);

after it's been converted to QString so the hex value has some extra at the begging.

like image 672
Steve Moretz Avatar asked Aug 25 '18 14:08

Steve Moretz


1 Answers

I was far to curious to wait for reply and toyed a bit on my own:

I copied the text سلام (in English: "Hello") and pasted it into Nodepad++ (which used UTF-8 encoding in my case). Then I switched to View as Hex and got:

snapshot of Notepad++ - hex dump of "سلام"

The ASCII dump on right side looks a bit similar to what OP got unexpectedly. This let me believe that the bytes in readData are encoded in UTF-8. Hence, I took the exposed hex-numbers and made a little sample code:

testQPersian.cc:

#include <QtWidgets>

int main(int argc, char **argv)
{
  QByteArray readData = "\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85";
  QString textLatin1 = QString::fromLatin1(readData);
  QString textUtf8 = QString::fromUtf8(readData);
  QApplication app(argc, argv);
  QWidget qWin;
  QGridLayout qGrid;
  qGrid.addWidget(new QLabel("Latin-1:"), 0, 0);
  qGrid.addWidget(new QLabel(textLatin1), 0, 1);
  qGrid.addWidget(new QLabel("UTF-8:"), 1, 0);
  qGrid.addWidget(new QLabel(textUtf8), 1, 1);
  qWin.setLayout(&qGrid);
  qWin.show();
  return app.exec();
}

testQPersian.pro:

SOURCES = testQPersian.cc

QT += widgets

Compiled and tested in cygwin on Windows 10:

$ qmake-qt5 testQPersian.pro

$ make

$ ./testQPersian

snapshot of testQPersian

Again, the output as Latin-1 looks a bit similar to what OP got as well as what Notepad++ exposed.

The output as UTF-8 provides the expected text (as expected because I provided a proper UTF-8 encoding as input).

May be, it's a bit confusing that the ASCII/Latin-1 output vary. – There exists multiple character byte encodings which share the ASCII in the lower half (0 ... 127) but have different meanings of bytes in the upper half (128 ... 255). (Have a look at ISO/IEC 8859 to see what I mean. These have been introduced as localizations before Unicode became popular as the final solution of the localization problem.)

The Persian characters have surely all Unicode codepoints beyond 127. (Unicode shares the ASCII for the first 128 codepoints as well.) Such codepoints are encoded in UTF-8 as sequences of multiple bytes where each byte has the MSB (the most significant bit – Bit 7) set. Hence, if these bytes are (accidentally) interpreted with any ISO8859 encoding then the upper half becomes relevant. Thus, depending on the currently used ISO8859 encoding, this may produce different glyphs.


Some continuation:

OP sent the following snapshot:

Snapshot (provided by OP)

So, it seems instead of

d8 b3 d9 84 d8 a7 d9 85

he got

00 08 d8 b3 d9 84 d8 a7 d9 85

A possible interpretation:

The server sends first a 16 bit length 00 08 – interpreted as Big-Endian 16 bit integer: 8, then 8 bytes encoded in UTF-8 (which look exactly like the one I got with playing above). (AFAIK, it's not unusual to use Big-Endian for binary network protocols to prevent endianess issues if sender and receiver have natively different endianess.) Further reading e.g. here: htons(3) - Linux man page

On the i386 the host byte order is Least Significant Byte first, whereas the network byte order, as used on the Internet, is Most Significant Byte first.


OP claims that this protocol is used DataOutput – writeUTF:

Writes two bytes of length information to the output stream, followed by the modified UTF-8 representation of every character in the string s. If s is null, a NullPointerException is thrown. Each character in the string s is converted to a group of one, two, or three bytes, depending on the value of the character.

So, the decoding could look like this:

QByteArray readData("\x00\x08\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85", 10);
//QByteArray readData = socket->readAll();
unsigned length
  = ((uint8_t)readData[0] <<  8) + (uint8_t)readData[1];
QString text = QString::fromUtf8(dataRead.data() + 2, length);
  1. The first two bytes are extracted from readData and combined to the length (decoding big-endian 16 bit integer).

  2. The rest of dataRead is converted to QString providing the previously extracted length. Thereby, the first 2 length bytes of readData are skipped.

like image 126
Scheff's Cat Avatar answered Oct 12 '22 18:10

Scheff's Cat