BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

Tags:

I'm a newbie programmer trying to jump in to Python by building a script that scrapes http://en.wikipedia.org/wiki/2000s_in_film and extracts a list of "Movie Title (Year)". My HTML source looks like:

<h3>Header3 (Start here)</h3> <ul>     <li>List items</li>     <li>Etc...</li> </ul> <h3>Header 3</h3> <ul>     <li>List items</li>     <ul>         <li>Nested list items</li>         <li>Nested list items</li></ul>     <li>List items</li> </ul> <h2>Header 2 (end here)</h2>

I'd like all the li tags following the first h3 tag and stopping at the next h2 tag, including all nested li tags.

firstH3 = soup.find('h3')

...correctly finds the place I'd like to start.

firstH3 = soup.find('h3') # Start here uls = [] for nextSibling in firstH3.findNextSiblings():     if nextSibling.name == 'h2':         break     if nextSibling.name == 'ul':         uls.append(nextSibling)

...gives me a list uls, each with li contents that I need.

Excerpt of the uls list:

<ul> ...     <li><i><a href="/wiki/Agent_Cody_Banks" title="Agent Cody Banks">Agent Cody Banks</a></i> (2003)</li>     <li><i><a href="/wiki/Agent_Cody_Banks_2:_Destination_London" title="Agent Cody Banks 2: Destination London">Agent Cody Banks 2: Destination London</a></i> (2004)</li>     <li>Air Bud series:         <ul>             <li><i><a href="/wiki/Air_Bud:_World_Pup" title="Air Bud: World Pup">Air Bud: World Pup</a></i> (2000)</li>             <li><i><a href="/wiki/Air_Bud:_Seventh_Inning_Fetch" title="Air Bud: Seventh Inning Fetch">Air Bud: Seventh Inning Fetch</a></i> (2002)</li>             <li><i><a href="/wiki/Air_Bud:_Spikes_Back" title="Air Bud: Spikes Back">Air Bud: Spikes Back</a></i> (2003)</li>             <li><i><a href="/wiki/Air_Buddies" title="Air Buddies">Air Buddies</a></i> (2006)</li>         </ul>     </li>     <li><i><a href="/wiki/Akeelah_and_the_Bee" title="Akeelah and the Bee">Akeelah and the Bee</a></i> (2006)</li> ... </ul>

But I'm unsure of where to go from here.

Update:

Final Code:

lis = []     for ul in uls:         for li in ul.findAll('li'):             if li.find('ul'):                 break             lis.append(li)      for li in lis:         print li.text.encode("utf-8")

The if...break throws out the LI's that contain UL's since the nested LI's are now duplicated.

Print output is now:

102 Dalmatians(2000)

10th & Wolf(2006)

11:14(2006)

12:08 East of Bucharest(2006)

13 Going on 30(2004)

1408(2007)

...

769

asked Dec 06 '10 03:12

danneu

2 Answers

.findAll() works for nested li elements:

for ul in uls:     for li in ul.findAll('li'):         print(li)

Output:

<li>List items</li> <li>Etc...</li> <li>List items</li> <li>Nested list items</li> <li>Nested list items</li> <li>List items</li>

196

answered Oct 03 '22 02:10

jfs

A list comprehension could work, too.

lis = [li for ul in uls for li in ul.findAll('li')]

answered Oct 03 '22 01:10

zachwill

Related questions
                            
                                mean calculation in pandas excluding zeros
                            
                                Append 2D array to 3D array, extending third dimension
                            
                                How to make python .post() requests to retry?
                            
                                How can I access different Anaconda environment from Pycharm (on Windows 10)
                            
                                Keras confusion about number of layers
                            
                                Pickle alternatives
                            
                                Python with selenium: unable to locate element which really exist
                            
                                built-in max heap API in Python
                            
                                mean from pandas and numpy differ
                            
                                How do I validate xml against a DTD file in Python
                            
                                Python NotImplemented constant
                            
                                How do I read the output of the IPython %prun (profiler) command?
                            
                                cx_Oracle and Exception Handling - Good practices?
                            
                                Most efficient way to split strings in Python
                            
                                how to make argument optional in python argparse
                            
                                Why use classmethod instead of staticmethod? [duplicate]
                            
                                PEP8 naming convention on test classes
                            
                                Python 3, module 'itertools' has no attribute 'ifilter'
                            
                                Py.test: parametrize test cases from classes
                            
                                Python Can't install packages

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

Tags:

python

html

beautifulsoup

screen-scraping

danneu

People also ask

2 Answers

jfs

zachwill

Recent Activity

Donate For Us