Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to workwith generators from file for tokenization rather than materializing a list of strings?

I have 2 files:

hyp.txt

It is a guide to action which ensures that the military always obeys the commands of the party
he read the book because he was interested in world history

ref.txt

It is a guide to action that ensures that the military will forever heed Party commands
he was interested in world history because he read the book

And I have a function that does some calculations to compare the lines of the text, e.g. line 1 of hyp.txt with line 1 of ref.txt.

def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
   """
   :type list_of_tokenized_hyp: iter(iter(str))
   :type list_of_tokenized_ref: iter(iter(str))
   """   
   for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
       # do something with the iter(str)
   return score

And this function cannot be changed. I can however manipulate what i feed to the function. So currently I'm feeding the file into the function like this:

with open('hyp.txt', 'r') as hypfin, open('ref.txt', 'r') as reffin:
    hyp = [line.split() for line in hypfin]
    ref = [line.split() for line in reffin]
    scorer(hypfin, reffin)

But by doing so I have loaded the whole file and the split string into memory before feeding it into the scorer().

Know that the scorer() is processing the files line by line, is there a way not to materialize the split string before feeding into the function without changing the scorer() function?

Is there a way to feed in some sort of generator instead?

I've tried this:

with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as ref1fin, open('ref2.txt', 'r') as ref2fin:
    hyp = (h.split() for h in hypline)
    ref = (r.split() for r in hypline)
    scorer(hypfin, reffin)

but I'm not sure whether the h.split() has been materialized. If it has been materialized, why? If not, why?

If I could change the scorer() function, then I could easily add this line after the for:

def scorer(list_of_tokenized_hyp, list_of_tokenized_ref):
   for hypline, refline in zip(list_of_tokenized_hyp, list_of_tokenized_ref):
       hypline = hypline.split()
       refline = refline.split()
       # do something with the iter(str)
   return score

But this is not possible in my case, since I can't change that function.

like image 542
alvas Avatar asked Jan 03 '16 22:01

alvas


2 Answers

Yes, your example defines two generators

with open('hyp.txt', 'r') as hypfin, open('ref1.txt', 'r') as reffin:
    hyp = (h.split() for h in hypfin)
    ref = (r.split() for r in reffin)
    scorer(hyp, ref)

and the split, and the corresponding reading of the next line, is done for each for-loop-iteration.

like image 169
Daniel Avatar answered Sep 22 '22 16:09

Daniel


Your generator expressions combined with Python 3's zip() (replace with itertools.izip() in Python 2) behave as you require, i.e. they do not read the entire file to create the splitted lists in one go.

You can get some idea as to what is going on by substituting a logging version of str.split():

def my_split(s):
    print('my_split(): {!r}'.format(s))
    return s.split()

>>> hypfin = open('hyp.txt', 'r')
>>> reffin = open('ref.txt', 'r')
>>> hyp = (my_split(h) for h in hypfin)    # N.B. my_split() not called here
>>> hyp
<generator object <genexpr> at 0x7fa89ad16b40>
>>> ref = (my_split(r) for r in reffin)    # N.B. my_split() not called here
>>> ref
<generator object <genexpr> at 0x7fa89ad16bd0>

>>> z = zip(hyp, ref)    # N.B. my_split() not called here
>>> z
<zip object at 0x7fa89ad15cc8>

>>> hypline, refline = next(z)
my_split(): 'It is a guide to action which ensures that the military always obeys the commands of the party\n'
my_split(): 'It is a guide to action that ensures that the military will forever heed Party commands\n'
>>> hypline, refline = next(z)
my_split(): 'he read the book because he was interested in world history\n'
my_split(): 'he was interested in world history because he read the book\n'
>>> hypline, refline = next(z)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

From the output of my_split() you can see that hyp and ref are indeed generators that do not consume input until required. z is a zip object that also does not consume any input until accessed. The for loop is simulated with next() to demonstrate that only one line of input from each file is consumed at each iteration.

like image 22
mhawke Avatar answered Sep 21 '22 16:09

mhawke