Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I match a Russian word in Unicode text using Perl?

I have a website I want to regexp on, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian and I want to pull out all the Russian words. Matching with \w+ doesn't work and matching with \p{L}+ retrieves everything.

How do I do it?


2 Answers

All those answers are overcomplicated. Use this

$text =~/\p{cyrillic}/

bam.

like image 102
Karel Bílek Avatar answered Nov 26 '25 22:11

Karel Bílek


perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl>

Well, that doesn't help!

Downloading a copy first, this seems to work:

use Encode;

local $/ = undef;
my $text = decode_utf8(<>);

my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);

foreach my $word (@words) {
  print encode_utf8($word) . "\n";
}
like image 23
Bron Gondwana Avatar answered Nov 26 '25 23:11

Bron Gondwana



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!