Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl paragraph n-gram

Tags:

perl

n-gram

Let's say I have a sentence of text:

$body = 'the quick brown fox jumps over the lazy dog';

and I want to get that sentence into a hash of 'keywords', but I want to allow multi-word keywords; I have the following to get single word keywords:

$words{$_}++ for $body =~ m/(\w+)/g;

After this is complete, I have a hash that looks like the following:

'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1

The next step, so that I can get 2-word keywords, is the following:

$words{$_}++ for $body =~ m/(\w+ \w+)/g;

But that only gets every "other" pair; looking like this:

'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1

I also need the one word offset:

'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1

Is there an easier way to do this than the following?

my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
like image 754
Glen Solsberry Avatar asked Aug 18 '10 20:08

Glen Solsberry


People also ask

What is character n-grams?

Character n-grams are found in text documents by representing the document as a sequence of characters. These n-grams are then extracted from this sequence and a model is trained. There are a number of different models for this, but a standard one is very similar to the bag-of-words model we have used earlier.

What is n-gram give examples of 1 gram 2 gram and 3 gram?

An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).

What is the use of n-grams in NLP?

N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks.

What is n-gram analysis?

An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation.


3 Answers

While the described task might be interesting to code by hand, would not it be better to use an existing CPAN module that handles n-grams? It looks like Text::Ngrams (as opposed to Text::Ngram) can handle word-based n-gram analysis.

like image 73
Grrrr Avatar answered Nov 12 '22 14:11

Grrrr


You can do something a little funky with lookaheads:

If I do:

$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;

That expression says to look ahead for two words (and capture them), but consume 1.

I get:

%words: {
          'brown fox' => 1,
          'fox jumps' => 1,
          'jumps over' => 1,
          'lazy dog' => 1,
          'over the' => 1,
          'quick brown' => 1,
          'the lazy' => 1,
          'the quick' => 1
        }

It seems I can generalize this by putting in a variable for count:

my $n    = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;
like image 21
Axeman Avatar answered Nov 12 '22 16:11

Axeman


I would use look-ahead to collect everything but the first word. That way, the position advances correctly automatically:

my $body = 'the quick brown fox jumps over the lazy dog';

my %words;

++$words{$1}         while $body =~ m/(\w+)/g;
++$words{"$1 $2"}    while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;

You could simplify it a bit if you want to stick with a single space instead of \s+ (don't forget to remove the /x modifier if you do that), since you could collect any number of words in $2, instead of using one group per word.

like image 2
cjm Avatar answered Nov 12 '22 15:11

cjm