I'm thoroughly puzzled. I have a block of HTML that I scraped out of a larger table. It looks about like this:
<td align="left" class="page">Number:\xc2\xa0<a class="topmenu" href="http://www.example.com/whatever.asp?search=724461">724461</a> Date:\xc2\xa01/1/1999 Amount:\xc2\xa0$2.50 <br/>Person:<br/><a class="topmenu" href="http://www.example.com/whatever.asp?search=LAST&searchfn=FIRST">LAST,\xc2\xa0FIRST </a> </td>
(Actually, it looked worse, but I regexed out a lot of line breaks)
I need to get the lines out, and break up the Date/Amount line. It seemed like the place to start was to find the children of that block of HTML. The block is a string because that's how regex gave it back to me. So I did:
text_soup = BeautifulSoup(text)
text_children = text_soup.find('td').childGenerator()
I've worked out that I can only iterate through text_children
once, though I don't understand why that is. It's a listiterator
type, which I'm struggling to understand.
I'm used to being able to assume that if I can iterate through something with a for loop I can call on any one element with something like text_children[0]. That doesn't seem to be the case with an iterator. If I create a list with:
my_array = ["one","two","three"]
I can use my_array[1]
to see the second item in the array. If I try to do text_children[1]
I get an error:
TypeError: 'listiterator' object is not subscriptable
How do I get at the contents of an iterator?
Before you can access a collection through an iterator, you must obtain one. Each of the collection classes provides an iterator( ) method that returns an iterator to the start of the collection. By using this iterator object, you can access each element in the collection, one element at a time.
Iterator in Python is an object that is used to iterate over iterable objects like lists, tuples, dicts, and sets. The iterator object is initialized using the iter() method. It uses the next() method for iteration. __next__(): The next method returns the next value for the iterable.
An iterator is an object (like a pointer) that points to an element inside the container. We can use iterators to move through the contents of the container. They can be visualized as something similar to a pointer pointing to some location and we can access the content at that particular location using them.
An object is called iterable if we can get an iterator from it. Most built-in containers in Python like: list, tuple, string etc. are iterables. The iter() function (which in turn calls the __iter__() method) returns an iterator from them.
You can easy construct a list from the iterator by:
my_list = list(your_generator)
Now you can subscript the elements:
print(my_list[1])
another way to get the value is by using next
. This will pull the next value from the iterator, but as you've already discovered, once you pull a value out of the iterator, you can't always put it back in (whether or not you can put it back in depends entirely on the object that is being iterated over and what its next
method actually looks like).
The reason for this is that often you just want an object that you can iterate over. iterators are great for that as they calculate the elements 1 at a time rather than needing to store all of the values. In other words, you only have one element from the iterator consuming your system's memory at a time -- vs. a list or a tuple where all of the elements are typically stored in memory before you start iterating.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With