Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I detect Russian spam posts with Perl?

I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?

like image 553
Matthew Lock Avatar asked Sep 09 '09 08:09

Matthew Lock


2 Answers

You can use the following to detect Cyrillic characters (used in Russian):

[\u0400-\u04FF]+

If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.

like image 76
João Silva Avatar answered Sep 21 '22 17:09

João Silva


using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.

i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:

koi8-r
windows-1251
iso-8859-5

next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.

like image 43
mehmet el kasid Avatar answered Sep 18 '22 17:09

mehmet el kasid