Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting russian characters on a form in PHP

Tags:

php

I have a site where people can submit links to sites about iPhone apps. The guy submits the application name, description, category and URL. This site has years and never received any constructive submission from a russian developer but, unfortunately it was discovered by russian spammers that annoys the hell out of me. Even with all measures against spam, as caption boxes, etc., some guys insist on sending porn russian stuff that has nothing to do with iPhone.

I would like to ban completely any URL or post that is done using russian characters. For URLs I have not much to do, except checking if the URL contains ".ru". But for descriptions, I would like to detect russian characters. How do I do that in PHP?

thanks.

like image 909
Duck Avatar asked Jul 09 '10 11:07

Duck


4 Answers

Да очень просто It is easy to do with UTF-8 regular expressions (assuming your site uses UTF-8 encoding):

function isRussian($text) {
    return preg_match('/[А-Яа-яЁё]/u', $text);
}
like image 164
Alexander Konstantinov Avatar answered Nov 20 '22 02:11

Alexander Konstantinov


According to the PHP documentation, since version 5.1.0 it has been possible to look for specific (writing) scripts in utf-8 PCRE regular expressions by using \p{language code}. For Rusian that is

preg_match( '/[\p{Cyrillic}]/u', $text); 

There is a warning on the page:

Matching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters.

like image 24
Julia Clement Avatar answered Nov 20 '22 00:11

Julia Clement


now.. this code is about 5 years old, and 'worked for me' back when I had a similar problem

function detect_cyr_utf8($content)
{
  return preg_match('/&#10[78]\d/', mb_encode_numericentity($content, array(0x0, 0x2FFFF, 0, 0xFFFF), 'UTF-8'));
}

thus no warranty, no any of the kind - but it may help you out (basically it encodes all foreign entities then checks for common cyrillic chars)

Best!

like image 3
nathan Avatar answered Nov 20 '22 02:11

nathan


I would download the Russian alphabet and then check the input string with strstr(). For example:

$russianChars = array('з', 'я'.. etc);

foreach($russianChars as $char) {
    if(strstr($input, $char)) {
        // russian char found in input, do something
    }
}

A good algorithm would probably do something after finding 3 Russian chars or so, to be sure that the language is actually Russian (since Russian chars may show up in other languages, I suggest doing some research if that's the case).

like image 2
Luca Matteis Avatar answered Nov 20 '22 02:11

Luca Matteis