Let's say I have a sentence of text:
$body = 'the quick brown fox jumps over the lazy dog';
and I want to get that sentence into a hash of 'keywords', but I want to allow multi-word keywords; I have the following to get single word keywords:
$words{$_}++ for $body =~ m/(\w+)/g;
After this is complete, I have a hash that looks like the following:
'the' => 2,
'quick' => 1,
'brown' => 1,
'fox' => 1,
'jumps' => 1,
'over' => 1,
'lazy' => 1,
'dog' => 1
The next step, so that I can get 2-word keywords, is the following:
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
But that only gets every "other" pair; looking like this:
'the quick' => 1,
'brown fox' => 1,
'jumps over' => 1,
'the lazy' => 1
I also need the one word offset:
'quick brown' => 1,
'fox jumps' => 1,
'over the' => 1
Is there an easier way to do this than the following?
my $orig_body = $body;
# single word keywords
$words{$_}++ for $body =~ m/(\w+)/g;
# double word keywords
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+)/g;
$body = $orig_body;
# triple word keywords
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body =~ s/^(\w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
$body = $orig_body;
$body =~ s/^(\w+ \w+)//;
$words{$_}++ for $body =~ m/(\w+ \w+ \w+)/g;
Character n-grams are found in text documents by representing the document as a sequence of characters. These n-grams are then extracted from this sequence and a model is trained. There are a number of different models for this, but a standard one is very similar to the bag-of-words model we have used earlier.
An N-gram means a sequence of N words. So for example, “Medium blog” is a 2-gram (a bigram), “A Medium blog post” is a 4-gram, and “Write on Medium” is a 3-gram (trigram).
N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks.
An n-gram is a collection of n successive items in a text document that may include words, numbers, symbols, and punctuation. N-gram models are useful in many text analytics applications where sequences of words are relevant, such as in sentiment analysis, text classification, and text generation.
While the described task might be interesting to code by hand,
would not it be better to use an existing CPAN module that handles n-grams? It looks like Text::Ngrams
(as opposed to Text::Ngram
) can handle word-based n-gram analysis.
You can do something a little funky with lookaheads:
If I do:
$words{$_}++ for $body =~ m/(?=(\w+ \w+))\w+/g;
That expression says to look ahead for two words (and capture them), but consume 1.
I get:
%words: {
'brown fox' => 1,
'fox jumps' => 1,
'jumps over' => 1,
'lazy dog' => 1,
'over the' => 1,
'quick brown' => 1,
'the lazy' => 1,
'the quick' => 1
}
It seems I can generalize this by putting in a variable for count:
my $n = 4;
$words{$_}++ for $body =~ m/(?=(\w+(?: \w+){$n}))\w+/g;
I would use look-ahead to collect everything but the first word. That way, the position advances correctly automatically:
my $body = 'the quick brown fox jumps over the lazy dog';
my %words;
++$words{$1} while $body =~ m/(\w+)/g;
++$words{"$1 $2"} while $body =~ m/(\w+) \s+ (?= (\w+) )/gx;
++$words{"$1 $2 $3"} while $body =~ m/(\w+) \s+ (?= (\w+) \s+ (\w+) )/gx;
You could simplify it a bit if you want to stick with a single space instead of \s+
(don't forget to remove the /x
modifier if you do that), since you could collect any number of words in $2
, instead of using one group per word.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With