I'm reading a file (while doing some expensive logic) that I will need to iterate several times in different functions, so I really want to read and parse the file only once.
The parsing function parses the file and returns an itertools.groupby
object.
def parse_file():
...
return itertools.groupby(lines, key=keyfunc)
I thought about doing the following:
csv_file_content = read_csv_file()
file_content_1, file_content_2 = itertools.tee(csv_file_content, 2)
foo(file_content_1)
bar(file_content_2)
However, itertools.tee
seems to only be able to "duplicate" the external iterator, while the internal (nested) iterator still refers to the original (hence it will be exhausted after iterating over the 1st iterator returned by itertools.tee
).
Standalone MCVE:
from itertools import groupby, tee
li = [{'name': 'a', 'id': 1},
{'name': 'a', 'id': 2},
{'name': 'b', 'id': 3},
{'name': 'b', 'id': 4},
{'name': 'c', 'id': 5},
{'name': 'c', 'id': 6}]
groupby_obj = groupby(li, key=lambda x:x['name'])
tee_obj1, tee_obj2 = tee(groupby_obj, 2)
print(id(tee_obj1))
for group, data in tee_obj1:
print(group)
print(id(data))
for i in data:
print(i)
print('----')
print(id(tee_obj2))
for group, data in tee_obj2:
print(group)
print(id(data))
for i in data:
print(i)
Outputs
2380054450440
a
2380053623136
{'name': 'a', 'id': 1}
{'name': 'a', 'id': 2}
b
2380030915976
{'name': 'b', 'id': 3}
{'name': 'b', 'id': 4}
c
2380054184344
{'name': 'c', 'id': 5}
{'name': 'c', 'id': 6}
----
2380064387336
a
2380053623136 # same ID as above
b
2380030915976 # same ID as above
c
2380054184344 # same ID as above
How can we efficiently duplicate a nested iterator?
tee() function This iterator splits the container into a number of iterators mentioned in the argument. Parameter: This method contains two arguments, the first argument is iterator and the second argument is a integer. Return Value: This method returns the number of iterators mentioned in the argument.
groupby() This method calculates the keys for each element present in iterable. It returns key and iterable of grouped items.
The groupby() function takes two arguments: (1) the data to group and (2) the function to group it with. Here, lambda x: x[0] tells groupby() to use the first item in each tuple as the grouping key. In the above for statement, groupby returns three (key, group iterator) pairs - once for each unique key.
That being said, the iterators from itertools are often significantly faster than regular iteration from a standard Python for loop.
It seems like grouped_object
(class 'itertools.groupby
') be consumed once, even in itertools.tee
.
Also parallel assignement of the same grouped_object
doesn't work:
tee_obj1, tee_obj2 = groupby_obj, groupby_obj
What's working is a deep copy of the grouped_object
:
tee_obj1, tee_obj2 = copy.deepcopy(groupby_obj), groupby_obj
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With