Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using itertools.tee to duplicate a nested iterator (ie itertools.groupby)

I'm reading a file (while doing some expensive logic) that I will need to iterate several times in different functions, so I really want to read and parse the file only once.

The parsing function parses the file and returns an itertools.groupby object.

def parse_file():
    ...
    return itertools.groupby(lines, key=keyfunc)

I thought about doing the following:

csv_file_content = read_csv_file()

file_content_1, file_content_2 = itertools.tee(csv_file_content, 2)

foo(file_content_1)
bar(file_content_2)

However, itertools.tee seems to only be able to "duplicate" the external iterator, while the internal (nested) iterator still refers to the original (hence it will be exhausted after iterating over the 1st iterator returned by itertools.tee).

Standalone MCVE:

from itertools import groupby, tee

li = [{'name': 'a', 'id': 1},
      {'name': 'a', 'id': 2},
      {'name': 'b', 'id': 3},
      {'name': 'b', 'id': 4},
      {'name': 'c', 'id': 5},
      {'name': 'c', 'id': 6}]

groupby_obj = groupby(li, key=lambda x:x['name'])
tee_obj1, tee_obj2 = tee(groupby_obj, 2)

print(id(tee_obj1))
for group, data in tee_obj1:
    print(group)
    print(id(data))
    for i in data:
        print(i)

print('----')

print(id(tee_obj2))
for group, data in tee_obj2:
    print(group)
    print(id(data))
    for i in data:
        print(i)

Outputs

2380054450440
a
2380053623136
{'name': 'a', 'id': 1}
{'name': 'a', 'id': 2}
b
2380030915976
{'name': 'b', 'id': 3}
{'name': 'b', 'id': 4}
c
2380054184344
{'name': 'c', 'id': 5}
{'name': 'c', 'id': 6}
----
2380064387336
a
2380053623136  # same ID as above
b
2380030915976  # same ID as above 
c
2380054184344  # same ID as above

How can we efficiently duplicate a nested iterator?

like image 204
DeepSpace Avatar asked Jan 01 '19 09:01

DeepSpace


People also ask

What does Itertools tee do?

tee() function This iterator splits the container into a number of iterators mentioned in the argument. Parameter: This method contains two arguments, the first argument is iterator and the second argument is a integer. Return Value: This method returns the number of iterators mentioned in the argument.

What does Python Itertools Groupby () do?

groupby() This method calculates the keys for each element present in iterable. It returns key and iterable of grouped items.

How do you use Groupby Itertools?

The groupby() function takes two arguments: (1) the data to group and (2) the function to group it with. Here, lambda x: x[0] tells groupby() to use the first item in each tuple as the grouping key. In the above for statement, groupby returns three (key, group iterator) pairs - once for each unique key.

Is Itertools faster than for loops?

That being said, the iterators from itertools are often significantly faster than regular iteration from a standard Python for loop.


1 Answers

It seems like grouped_object (class 'itertools.groupby') be consumed once, even in itertools.tee. Also parallel assignement of the same grouped_object doesn't work:

tee_obj1, tee_obj2 = groupby_obj, groupby_obj

What's working is a deep copy of the grouped_object:

tee_obj1, tee_obj2 = copy.deepcopy(groupby_obj), groupby_obj
like image 73
iGian Avatar answered Sep 21 '22 17:09

iGian