I have a website I want to regexp on, say http://www.ru.wikipedia.org/wiki/perl . The site is in Russian and I want to pull out all the Russian words. Matching with \w+ doesn't work and matching with \p{L}+ retrieves everything.
How do I do it?
All those answers are overcomplicated. Use this
$text =~/\p{cyrillic}/
bam.
perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl>
Well, that doesn't help!
Downloading a copy first, this seems to work:
use Encode;
local $/ = undef;
my $text = decode_utf8(<>);
my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);
foreach my $word (@words) {
print encode_utf8($word) . "\n";
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With