Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Qt Turkish characters in regular expressions

Tags:

regex

qt

turkish

I want to validate QLineEdit's text with a regular expression. It should allow characters from a to z plus A to Z plus Turkish characters(ğüşöçİĞÜŞÖÇ) plus numbers from 0 to 9. I googled about my problem and found two solutions but neither one worked for me. In one solution it says "include Turkish characters in regexp" and in other one it says "use unicodes of turkish characters"

Below are two reqular expressions

QRegExp exp = QRegExp("^[a-zA-Z0-9ğüşöçİĞÜŞÖÇ]+$");

QRegExp exp = QRegExp("^[a-zA-Z0-9\u00E7\u011F\u0131\u015F\u00F6\u00FC\u00C7\u011E\u0130\u015E\u00D6\u00DC]+$");

Neither one of reqular expressions above can validate the name 'İSMAİL'. Also I tried a text only contains Turkish characters('ğüşöçİĞÜŞÖÇ') but it can not be validated. When I remove 'İ' character from both texts they can be validated. I guess the problem may be related with 'İ' character.

How can I solve the problem?

Note: We are using Qt 4.6.3 in our project.

like image 924
onurozcelik Avatar asked Jun 05 '13 07:06

onurozcelik


People also ask

What encoding does Turkish characters use?

Turkish computers may use character set ISO 8859-9 ("Latin 5"), which is identical to Latin 1 except that the rarely-used Icelandic characters "eth", "thorn", and "y with acute accent" are replaced with the needed Turkish characters.

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What is meaning of regex a za Z ]*$?

For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.

What is a tab character in regex?

You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A).


1 Answers

I think this is an encoding problem. You use implicit cast from const char* to QString which results in using QString::fromAscii. If you want to use non-Latin1 encoding here, you need to call QTextCodec::setCodecForCStrings and set the encoding your source files are saved in. I'd use UTF-8 encoding, so at the initialization of the app should be done like this:

QTextCodec::setCodecForCStrings(QTextCodec::codecForName("utf-8"));
QRegExp exp = QRegExp("^[a-zA-Z0-9ğüşöçİĞÜŞÖÇ]+$");
qDebug() << exp.exactMatch("İSMAİL"); // <= true

I suggest more clear solution to check if your problem is here. Save your code in UTF-8 encoding and use QString::fromUtf8 to convert your string literals to QString using UTF-8 explicitly:

QRegExp exp = QRegExp(QString::fromUtf8("^[a-zA-Z0-9ğüşöçİĞÜŞÖÇ]+$"));
qDebug() << exp.exactMatch(QString::fromUtf8("İSMAİL")); // <= true
like image 120
Pavel Strakhov Avatar answered Sep 22 '22 05:09

Pavel Strakhov