Let's say I have a sentence of text: <pre class="prettyprint"><code>$body = 'the quick brown fox jumps over the lazy dog'; </code></pre> and I want to get that sentence into a hash of 'keywords', but I want to allow multi-word keywords; I have the following to get single word keywords: <pre class="prettyprint"><code>$words{$_}++ for $body =~ m/(\w+)/g; </code></pre> After this is complete, I have a hash that looks like the following: <pre class="prettyprint"><code>'the' => 2, 'quick' => 1, 'brown' => 1, 'fox' => 1, 'jumps' => 1, 'over' => 1, 'lazy' => 1, 'dog' => 1 </code></pre> The next step, so that I can get 2-word keywords, is the following: <pre class="prettyprint"><code>$words{$_}++ for $body =~ m/(\w+ \w+)/g; </code></pre> But that only gets every "other" pair; looking like this: <pre class="prettyprint"><code>'the quick' => 1, 'brown fox' => 1, 'jumps over' => 1, 'the lazy' => 1 </code></pre> I also need the one word offset: <pre class="prettyprint"><code>'quick brown' => 1, 'fox jumps' => 1, 'over the' => 1 </code></pre> Is there an easier way to do this than the following? <pre class="prettyprint"><code>my $orig_body = $body; # single word keywords $words{$_}++ for $body =~ m/(\w+)/g; # double word keywords $words{$_}++ for $body =~ m/(\w+ \w+)/g; $body =~ s/^(\w+)//; $words{$_}++ for $body =~ m/(\w+ \w+)/g; $body = $orig_body; # triple word keywords $words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g; $body =~ s/^(\w+)//; $words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g; $body = $orig_body; $body =~ s/^(\w+ \w+)//; $words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g; </code></pre>

You can do something a little funky with lookaheads: If I do: <pre class="prettyprint"><code>$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g; </code></pre> That expression says to look ahead for two words (and capture them), but consume 1. I get: <pre class="prettyprint"><code>%words: { 'brown fox' => 1, 'fox jumps' => 1, 'jumps over' => 1, 'lazy dog' => 1, 'over the' => 1, 'quick brown' => 1, 'the lazy' => 1, 'the quick' => 1 } </code></pre> It seems I can generalize this by putting in a variable for count: <pre class="prettyprint"><code>my $n = 4; $words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g; </code></pre>

I would use look-ahead to collect everything but the first word. That way, the position advances correctly automatically: <pre class="prettyprint"><code>my $body = 'the quick brown fox jumps over the lazy dog'; my %words; ++$words{$1} while $body =~ m/(\w+)/g; ++$words{"$1 $2"} while $body =~ m/(\w+) \s+ (?= (\w+) )/gx; ++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx; </code></pre> You could simplify it a bit if you want to stick with a single space instead of <code>\s+</code> (don't forget to remove the <code>/x</code> modifier if you do that), since you could collect any number of words in <code>$2</code>, instead of using one group per word.

Perl paragraph n-gram

Tags:

perl

n-gram

Let's say I have a sentence of text:

$body = 'the quick brown fox jumps over the lazy dog';

and I want to get that sentence into a hash of 'keywords', but I want to allow multi-word keywords; I have the following to get single word keywords:

$words{$_}++ for $body =~ m/(\w+)/g;

After this is complete, I have a hash that looks like the following:

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

The next step, so that I can get 2-word keywords, is the following:

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

But that only gets every "other" pair; looking like this:

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

I also need the one word offset:

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

Is there an easier way to do this than the following?

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;

754

asked Aug 18 '10 20:08

Glen Solsberry

3 Answers

While the described task might be interesting to code by hand, would not it be better to use an existing CPAN module that handles n-grams? It looks like Text::Ngrams (as opposed to Text::Ngram) can handle word-based n-gram analysis.

answered Nov 12 '22 14:11

Grrrr

You can do something a little funky with lookaheads:

If I do:

$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

That expression says to look ahead for two words (and capture them), but consume 1.

I get:

%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

It seems I can generalize this by putting in a variable for count:

my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;

answered Nov 12 '22 16:11

Axeman

I would use look-ahead to collect everything but the first word. That way, the position advances correctly automatically:

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

You could simplify it a bit if you want to stick with a single space instead of \s+ (don't forget to remove the /x modifier if you do that), since you could collect any number of words in $2, instead of using one group per word.

answered Nov 12 '22 15:11

cjm

Related questions
                            
                                Why does Perl take up my memory (RAM) on printing a file?
                            
                                Sum durations in bash
                            
                                Simplest way to get a comprehensive listing of package names available in CPAN?
                            
                                Decode JSON without UTF8-flagged strings
                            
                                Trimming carriage return (\r) in Perl
                            
                                Perl defined-or in a list context, why a scalar?
                            
                                Deferring code on scope change in Perl
                            
                                How can I speed up Perl's processing of fixed-width data?
                            
                                How do I return nothing from a subroutine?
                            
                                How can I continuously inform the user of progress from a Perl CGI script?
                            
                                How was Google.com made? [closed]
                            
                                Is Perl a good option for writing platform independent desktop applications?
                            
                                What is the python equivalent of the Perl pattern to track if something has already been seen?
                            
                                How can I escape code-like things in a Perl string?
                            
                                Why does my Perl max() function always return the first element of the array?
                            
                                In Perl, how can I use Tie::IxHash with a dictionary while 'use strict' is on?
                            
                                How to export a shell variable within a Perl script?
                            
                                Perl Class::Accessor failure, trivial example - why?
                            
                                How can I download link targets from a web site using Perl?
                            
                                simple parallel processing in perl

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With