Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using closures as Iterators

I have recently been playing around with Markov chains, trying to generate text from a large corpus just to see what I got back (some of which was quite interesting).

A large part of building the data structure required for text generation is creating n-grams. Given a small sample text: "Today is Thursday March the sixth" an example n-gram where n = 3 would be:

Today is Thursday
is Thursday March 
Thursday March the
March the sixth
# skipped lines that have < 3 words because is isn't enough for a 3-gram

Depending on the size of the text, the list of n-grams generated by my code could be quite large, in some languages there is the concept of a generator that contains a yield statement to make custom iterators, but Perl is unfortunately not one of them.

Instead, in Perl we can use closures over lexical variables to create Iterators, but I am having a little trouble understanding what I am really gaining when using them.

Here is the iterator that I created to create n-grams (assume that n is held in $self->order):

sub _ngrams {
   my ($self, @words) = @_; 

   return sub {
      while(@words) {
         my @ngram = @words[0 .. $self->order]; # get $order + 1 words
         shift @words;                          # drop the first word

         return @ngram;
      }

      return; # nothing left to do
  };
}

Do I really gain anything from this code efficiency-wise? The list of words is still held entirely in memory in @words. Is there an alternative implementation that could reduce my memory footprint?

Here is how the iterator is used to generate the dictionary:

sub seed { 
   my $self = shift; 

   my $ngram_it = $self->_ngrams(split /\s+/, $self->text); 
GRAM:
   while (my @gram = $ngram_it->()) {
      next GRAM unless @gram == scalar grep { $_ } @gram;

      my $val = pop @gram; 
      my $key = join ' ', @gram; 

      if (exists $self->lexicon->{$key}) {
         push @{$self->lexicon->{$key}}, $val;
      }
      else {
         $self->lexicon->{$key} = [$val];
      }
   }
}

Any input would be very helpful.

like image 256
Hunter McMillen Avatar asked Jan 23 '26 01:01

Hunter McMillen


1 Answers

First of all, your iterator implementation has the bad tendency of returning undef items in the last few values. I'd change it to

sub _ngrams {
   my ($self, @words) = @_; 
   my $order = $self->order;

   return sub {
      if (@words > $order) {
         my @ngram = @words[0 .. $order]; # get $order + 1 words
         shift @words;                          # drop the first word

         return @ngram;
      }

      return; # nothing left to do
  };
}

Next, this iterator is a nice abstraction. It is not meant to increase performance in any way, it's only useful to make your main code simpler. Here, your code would be shorter (but not simpler) if you didn't separate out the iteration, and do it all in your main code.

However, iterators can handle interesting things like lazy evaluation or infinite streams. For this to be useful, we'd have to switch over completely to streams:

# contract: an iterator returns a list of things
# or an empty list when depleted

sub _ngrams {
   my ($self, $source) = @_; 
   my $order = $self->order;

   my @ngram = (undef, map { $source->() } 1 .. $order);

   return sub {
      if (my ($next) = $source->()) {
          (undef, @ngram) = (@ngram, $next);  # or instead: shift/push
          return @ngram;
      }
      return;
  };
}

Which would be initialized like

my $text = $self->text;
my $iter = $self->_ngrams(sub {
    return $1 if $text =~ /\G\s*(\S+)/gc;
    return;
});

Is this useful here? No, because you immediately fetch all elements from the iterator. The simplest solution would use no fancy abstractions, and simply be this:

sub seed { 
   my $self = shift; 

   my @words = split /\s+/, $self->text;
   my $order = $self->order;
   while (@words > $order) {
      my @gram = @words[0 .. $order];  # get the next n-gram
      shift @words;

      my $val = pop @gram; 
      push @{$self->lexicon->{join ' ', @gram}}, $val;
   }
}

I'd wager it's also the most (time-)performant variant.

Note: there is no need to test for exists, as Perl hashes autovivify. (Or are you using strange extensions?)

like image 147
amon Avatar answered Jan 24 '26 21:01

amon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!