Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you find all list items between two tags with BeautifulSoup?

For example, I'd like to pull out only Child1, Child2, and Child3 out of the below list where it is after the first instance of h3 and before the next tag of h3

<h3>HeaderName1<h3>
<ul class="prodoplist">
 <li>Parent</li>
 <li class="lev1">Child1</li>
 <li class="lev1">Child2</li>
 <li class="lev1">Child3</li>
  </ul>
  <h3>HeaderName2<h3>
   <ul class="prodoplist">
   <li>Parent2</li>
   <li class="lev1">Child4</li>
   <li class="lev1">Child5</li>
   <li class="lev1">Child6</li>
   </ul>
like image 877
Chris Avatar asked Jan 29 '14 04:01

Chris


People also ask

How do you find multiple tags in BeautifulSoup?

In order to use multiple tags or elements, we have to use a list or dictionary inside the find/find_all() function. find/find_all() functions are provided by a beautiful soup library to get the data using specific tags or elements.

What is Find () method in BeautifulSoup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.

Which BeautifulSoup method can find all the instances of a tag on a page?

In order to print all the heading tags using BeautifulSoup, we use the find_all() method. The find_all method is one of the most common methods in BeautifulSoup. It looks through a tag and retrieves all the occurrences of that tag.


3 Answers

using findChildren like:

for ul in soup.find_all('ul'):
    print 'ul start'
    for idx, li in enumerate(ul.findChildren('li')):
        if idx in range(3):
            print li

output:

ul start
<li>Parent</li>
<li class="lev1">Child1</li>
<li class="lev1">Child2</li>
ul start
<li>Parent2</li>
<li class="lev1">Child4</li>
<li class="lev1">Child5</li>

however, as in most cases lxml and xpath is a superior solution:

from lxml import html
doc = html.parse('input.html')
print [ul.xpath('li[1] | li[2] | li[3]') for ul in doc.xpath('//ul')]
like image 166
Guy Gavriely Avatar answered Nov 11 '22 14:11

Guy Gavriely


This should work .

import re
from BeautifulSoup import BeautifulSoup
html_doc = '<h3>HeaderName1</h3><ul class="prodoplist"><li>Parent</li><li class="lev1">Child1</li><li class="lev1">Child2</li><li class="lev1">Child3</li></ul>  <h3>HeaderName2</h3><ul class="prodoplist"><li>Parent2</li><li class="lev1">Child4</li><li class="lev1">Child5</li><li class="lev1">Child6</li></ul>'
m = re.search(r'<h3>.*?<h3>', html_doc, re.DOTALL)
s = m.start()
e = m.end() - len('<h3>')
target_html = html_doc[s:e]
new_bs = BeautifulSoup(target_html)
ul_eles = new_bs.findAll('ul', attrs={'class' : 'prodoplist'})
for ul_ele in ul_eles:
    li_eles = new_bs.findAll('li', attrs={'class' : 'lev1'})
    for li_ele in li_eles:
        print li_ele.text
like image 45
Priyank Patel Avatar answered Nov 11 '22 16:11

Priyank Patel


import requests
from BeautifulSoup import BeautifulSoup

children = []

url = "http://someurl.html"
r = requests.get(url)
bs = BeautifulSoup(r.text)
for uls in bs.findAll('ul', 'prodoplist'):
    lis = uls.findAll('li', 'lev1')
    for li in lis:
        children.append(li.text)

print children
like image 38
o-90 Avatar answered Nov 11 '22 14:11

o-90