I'm reading a file (while doing some expensive logic) that I will need to iterate several times in different functions, so I really want to read and parse the file only once. The parsing function parses the file and returns an <code>itertools.groupby</code> object. <pre class="prettyprint"><code>def parse_file(): ... return itertools.groupby(lines, key=keyfunc) </code></pre> I thought about doing the following: <pre class="prettyprint"><code>csv_file_content = read_csv_file() file_content_1, file_content_2 = itertools.tee(csv_file_content, 2) foo(file_content_1) bar(file_content_2) </code></pre> However, <code>itertools.tee</code> seems to only be able to "duplicate" the external iterator, while the internal (nested) iterator still refers to the original (hence it will be exhausted after iterating over the 1st iterator returned by <code>itertools.tee</code>). Standalone MCVE: <pre class="prettyprint"><code>from itertools import groupby, tee li = [{'name': 'a', 'id': 1}, {'name': 'a', 'id': 2}, {'name': 'b', 'id': 3}, {'name': 'b', 'id': 4}, {'name': 'c', 'id': 5}, {'name': 'c', 'id': 6}] groupby_obj = groupby(li, key=lambda x:x['name']) tee_obj1, tee_obj2 = tee(groupby_obj, 2) print(id(tee_obj1)) for group, data in tee_obj1: print(group) print(id(data)) for i in data: print(i) print('----') print(id(tee_obj2)) for group, data in tee_obj2: print(group) print(id(data)) for i in data: print(i) </code></pre> Outputs <pre class="prettyprint"><code>2380054450440 a 2380053623136 {'name': 'a', 'id': 1} {'name': 'a', 'id': 2} b 2380030915976 {'name': 'b', 'id': 3} {'name': 'b', 'id': 4} c 2380054184344 {'name': 'c', 'id': 5} {'name': 'c', 'id': 6} ---- 2380064387336 a 2380053623136 # same ID as above b 2380030915976 # same ID as above c 2380054184344 # same ID as above </code></pre> How can we efficiently duplicate a nested iterator?

It seems like <code>grouped_object</code> (<code>class 'itertools.groupby</code>') be consumed once, even in <code>itertools.tee</code>. Also parallel assignement of the same <code>grouped_object</code> doesn't work: <pre class="prettyprint"><code>tee_obj1, tee_obj2 = groupby_obj, groupby_obj </code></pre> What's working is a deep copy of the <code>grouped_object</code>: <pre class="prettyprint"><code>tee_obj1, tee_obj2 = copy.deepcopy(groupby_obj), groupby_obj </code></pre>

Using itertools.tee to duplicate a nested iterator (ie itertools.groupby)

I'm reading a file (while doing some expensive logic) that I will need to iterate several times in different functions, so I really want to read and parse the file only once.

The parsing function parses the file and returns an itertools.groupby object.

def parse_file():
    ...
    return itertools.groupby(lines, key=keyfunc)

I thought about doing the following:

csv_file_content = read_csv_file()

file_content_1, file_content_2 = itertools.tee(csv_file_content, 2)

foo(file_content_1)
bar(file_content_2)

However, itertools.tee seems to only be able to "duplicate" the external iterator, while the internal (nested) iterator still refers to the original (hence it will be exhausted after iterating over the 1^st iterator returned by itertools.tee).

Standalone MCVE:

from itertools import groupby, tee

li = [{'name': 'a', 'id': 1},
      {'name': 'a', 'id': 2},
      {'name': 'b', 'id': 3},
      {'name': 'b', 'id': 4},
      {'name': 'c', 'id': 5},
      {'name': 'c', 'id': 6}]

groupby_obj = groupby(li, key=lambda x:x['name'])
tee_obj1, tee_obj2 = tee(groupby_obj, 2)

print(id(tee_obj1))
for group, data in tee_obj1:
    print(group)
    print(id(data))
    for i in data:
        print(i)

print('----')

print(id(tee_obj2))
for group, data in tee_obj2:
    print(group)
    print(id(data))
    for i in data:
        print(i)

Outputs

2380054450440
a
2380053623136
{'name': 'a', 'id': 1}
{'name': 'a', 'id': 2}
b
2380030915976
{'name': 'b', 'id': 3}
{'name': 'b', 'id': 4}
c
2380054184344
{'name': 'c', 'id': 5}
{'name': 'c', 'id': 6}
----
2380064387336
a
2380053623136  # same ID as above
b
2380030915976  # same ID as above 
c
2380054184344  # same ID as above

How can we efficiently duplicate a nested iterator?

What does Itertools tee do?

tee() function This iterator splits the container into a number of iterators mentioned in the argument. Parameter: This method contains two arguments, the first argument is iterator and the second argument is a integer. Return Value: This method returns the number of iterators mentioned in the argument.

What does Python Itertools Groupby () do?

groupby() This method calculates the keys for each element present in iterable. It returns key and iterable of grouped items.

How do you use Groupby Itertools?

The groupby() function takes two arguments: (1) the data to group and (2) the function to group it with. Here, lambda x: x[0] tells groupby() to use the first item in each tuple as the grouping key. In the above for statement, groupby returns three (key, group iterator) pairs - once for each unique key.

Is Itertools faster than for loops?

That being said, the iterators from itertools are often significantly faster than regular iteration from a standard Python for loop.

It seems like grouped_object (class 'itertools.groupby') be consumed once, even in itertools.tee. Also parallel assignement of the same grouped_object doesn't work:

tee_obj1, tee_obj2 = groupby_obj, groupby_obj

What's working is a deep copy of the grouped_object:

tee_obj1, tee_obj2 = copy.deepcopy(groupby_obj), groupby_obj

Using itertools.tee to duplicate a nested iterator (ie itertools.groupby)

Tags:

python

iterator

itertools

DeepSpace

People also ask

1 Answers

iGian

Recent Activity

Donate For Us

Using itertools.tee to duplicate a nested iterator (ie itertools.groupby)

Tags:

python

iterator

itertools

DeepSpace

People also ask

1 Answers

iGian

Related questions

Recent Activity

Donate For Us