I have a site where people can submit links to sites about iPhone apps. The guy submits the application name, description, category and URL. This site has years and never received any constructive submission from a russian developer but, unfortunately it was discovered by russian spammers that annoys the hell out of me. Even with all measures against spam, as caption boxes, etc., some guys insist on sending porn russian stuff that has nothing to do with iPhone.
I would like to ban completely any URL or post that is done using russian characters. For URLs I have not much to do, except checking if the URL contains ".ru". But for descriptions, I would like to detect russian characters. How do I do that in PHP?
thanks.
Да очень просто It is easy to do with UTF-8 regular expressions (assuming your site uses UTF-8 encoding):
function isRussian($text) {
return preg_match('/[А-Яа-яЁё]/u', $text);
}
According to the PHP documentation, since version 5.1.0 it has been possible to look for specific (writing) scripts in utf-8 PCRE regular expressions by using \p{language code}. For Rusian that is
preg_match( '/[\p{Cyrillic}]/u', $text);
There is a warning on the page:
Matching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters.
now.. this code is about 5 years old, and 'worked for me' back when I had a similar problem
function detect_cyr_utf8($content)
{
return preg_match('/
[78]\d/', mb_encode_numericentity($content, array(0x0, 0x2FFFF, 0, 0xFFFF), 'UTF-8'));
}
thus no warranty, no any of the kind - but it may help you out (basically it encodes all foreign entities then checks for common cyrillic chars)
Best!
I would download the Russian alphabet and then check the input string with strstr()
. For example:
$russianChars = array('з', 'я'.. etc);
foreach($russianChars as $char) {
if(strstr($input, $char)) {
// russian char found in input, do something
}
}
A good algorithm would probably do something after finding 3 Russian chars or so, to be sure that the language is actually Russian (since Russian chars may show up in other languages, I suggest doing some research if that's the case).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With