I have an English language forum site written in perl that is continually bombarded with spam in Russian. Is there a way using Perl and regex to detect Russian text so I can block it?
You can use the following to detect Cyrillic characters (used in Russian):
[\u0400-\u04FF]+
If you really just want Russian characters, you can take a look at the aforesaid document, which contains the exact range used for the Basic Russian alphabet which is [\u0410-\u044F]
. Of course you'd also need to consider extension Cyrillic characters that are used exclusively in Russian -- also mentioned in the document.
using the unicode cyrillic charset as suggested by JG is fine if everything is encoded as such. however, this is spam and for the most part, things are not. additionally, spammers will very often use a mix of charsets in spams which further screws up this approach.
i find that the best way (or at least the preliminary step in the process) of detecting russian spam is to grep for the most commonly used charsets:
koi8-r
windows-1251
iso-8859-5
next step after that would be to try some language detection algorithms on what remains. if it's a big enough problem, use a paid service such as google translate (which also "detects") or xerox. these services provide IMO the best language detection around.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With