How do I match a Russian word in Unicode text using Perl?

Question

I have a website I want to regexp on, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian and I want to pull out all the Russian words. Matching with \w+ doesn't work and matching with \p{L}+ retrieves everything.

How do I do it?

Karel Bílek · Accepted Answer

All those answers are overcomplicated. Use this

$text =~/\p{cyrillic}/

bam.

Bron Gondwana · Answer

perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl>

Well, that doesn't help!

Downloading a copy first, this seems to work:

use Encode;

local $/ = undef;
my $text = decode_utf8(<>);

my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);

foreach my $word (@words) {
  print encode_utf8($word) . "
";
}

How do I match a Russian word in Unicode text using Perl?

Tags:

regex

unicode

perl

2 Answers

Karel Bílek

Bron Gondwana

Recent Activity

Donate For Us

How do I match a Russian word in Unicode text using Perl?

Tags:

regex

unicode

perl

2 Answers

Karel Bílek

Bron Gondwana

Related questions

Recent Activity

Donate For Us