I'm a newbie programmer trying to jump in to Python by building a script that scrapes http://en.wikipedia.org/wiki/2000s_in_film and extracts a list of "Movie Title (Year)". My HTML source looks like:
<h3>Header3 (Start here)</h3> <ul> <li>List items</li> <li>Etc...</li> </ul> <h3>Header 3</h3> <ul> <li>List items</li> <ul> <li>Nested list items</li> <li>Nested list items</li></ul> <li>List items</li> </ul> <h2>Header 2 (end here)</h2>
I'd like all the li
tags following the first h3
tag and stopping at the next h2
tag, including all nested li
tags.
firstH3 = soup.find('h3')
...correctly finds the place I'd like to start.
firstH3 = soup.find('h3') # Start here uls = [] for nextSibling in firstH3.findNextSiblings(): if nextSibling.name == 'h2': break if nextSibling.name == 'ul': uls.append(nextSibling)
...gives me a list uls
, each with li
contents that I need.
Excerpt of the uls
list:
<ul> ... <li><i><a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks">Agent Cody Banks</a></i> (2003)</li> <li><i><a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London">Agent Cody Banks 2: Destination London</a></i> (2004)</li> <li>Air Bud series: <ul> <li><i><a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup">Air Bud: World Pup</a></i> (2000)</li> <li><i><a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch">Air Bud: Seventh Inning Fetch</a></i> (2002)</li> <li><i><a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back">Air Bud: Spikes Back</a></i> (2003)</li> <li><i><a href="/wiki/Air_Buddies" title="Air Buddies">Air Buddies</a></i> (2006)</li> </ul> </li> <li><i><a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee">Akeelah and the Bee</a></i> (2006)</li> ... </ul>
But I'm unsure of where to go from here.
Update:
Final Code:
lis = [] for ul in uls: for li in ul.findAll('li'): if li.find('ul'): break lis.append(li) for li in lis: print li.text.encode("utf-8")
The if
...break
throws out the LI's that contain UL's since the nested LI's are now duplicated.
Print output is now:
- 102 Dalmatians(2000)
- 10th & Wolf(2006)
- 11:14(2006)
- 12:08 East of Bucharest(2006)
- 13 Going on 30(2004)
- 1408(2007)
- ...
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
.findAll()
works for nested li
elements:
for ul in uls: for li in ul.findAll('li'): print(li)
Output:
<li>List items</li> <li>Etc...</li> <li>List items</li> <li>Nested list items</li> <li>Nested list items</li> <li>List items</li>
A list comprehension could work, too.
lis = [li for ul in uls for li in ul.findAll('li')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With