Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

I'm a newbie programmer trying to jump in to Python by building a script that scrapes http://en.wikipedia.org/wiki/2000s_in_film and extracts a list of "Movie Title (Year)". My HTML source looks like:

<h3>Header3 (Start here)</h3> <ul>     <li>List items</li>     <li>Etc...</li> </ul> <h3>Header 3</h3> <ul>     <li>List items</li>     <ul>         <li>Nested list items</li>         <li>Nested list items</li></ul>     <li>List items</li> </ul> <h2>Header 2 (end here)</h2> 

I'd like all the li tags following the first h3 tag and stopping at the next h2 tag, including all nested li tags.

firstH3 = soup.find('h3') 

...correctly finds the place I'd like to start.

firstH3 = soup.find('h3') # Start here uls = [] for nextSibling in firstH3.findNextSiblings():     if nextSibling.name == 'h2':         break     if nextSibling.name == 'ul':         uls.append(nextSibling) 

...gives me a list uls, each with li contents that I need.

Excerpt of the uls list:

<ul> ...     <li><i><a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks">Agent Cody Banks</a></i> (2003)</li>     <li><i><a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London">Agent Cody Banks 2: Destination London</a></i> (2004)</li>     <li>Air Bud series:         <ul>             <li><i><a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup">Air Bud: World Pup</a></i> (2000)</li>             <li><i><a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch">Air Bud: Seventh Inning Fetch</a></i> (2002)</li>             <li><i><a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back">Air Bud: Spikes Back</a></i> (2003)</li>             <li><i><a href="/wiki/Air_Buddies" title="Air Buddies">Air Buddies</a></i> (2006)</li>         </ul>     </li>     <li><i><a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee">Akeelah and the Bee</a></i> (2006)</li> ... </ul> 

But I'm unsure of where to go from here.


Update:

Final Code:

lis = []     for ul in uls:         for li in ul.findAll('li'):             if li.find('ul'):                 break             lis.append(li)      for li in lis:         print li.text.encode("utf-8") 

The if...break throws out the LI's that contain UL's since the nested LI's are now duplicated.

Print output is now:

  • 102 Dalmatians(2000)
  • 10th & Wolf(2006)
  • 11:14(2006)
  • 12:08 East of Bucharest(2006)
  • 13 Going on 30(2004)
  • 1408(2007)
  • ...
like image 769
danneu Avatar asked Dec 06 '10 03:12

danneu


People also ask

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.


2 Answers

.findAll() works for nested li elements:

for ul in uls:     for li in ul.findAll('li'):         print(li) 

Output:

<li>List items</li> <li>Etc...</li> <li>List items</li> <li>Nested list items</li> <li>Nested list items</li> <li>List items</li> 
like image 196
jfs Avatar answered Oct 03 '22 02:10

jfs


A list comprehension could work, too.

lis = [li for ul in uls for li in ul.findAll('li')] 
like image 29
zachwill Avatar answered Oct 03 '22 01:10

zachwill