For example, I'd like to pull out only Child1, Child2, and Child3 out of the below list where it is after the first instance of h3 and before the next tag of h3
<h3>HeaderName1<h3>
<ul class="prodoplist">
<li>Parent</li>
<li class="lev1">Child1</li>
<li class="lev1">Child2</li>
<li class="lev1">Child3</li>
</ul>
<h3>HeaderName2<h3>
<ul class="prodoplist">
<li>Parent2</li>
<li class="lev1">Child4</li>
<li class="lev1">Child5</li>
<li class="lev1">Child6</li>
</ul>
In order to use multiple tags or elements, we have to use a list or dictionary inside the find/find_all() function. find/find_all() functions are provided by a beautiful soup library to get the data using specific tags or elements.
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
In order to print all the heading tags using BeautifulSoup, we use the find_all() method. The find_all method is one of the most common methods in BeautifulSoup. It looks through a tag and retrieves all the occurrences of that tag.
using findChildren like:
for ul in soup.find_all('ul'):
print 'ul start'
for idx, li in enumerate(ul.findChildren('li')):
if idx in range(3):
print li
output:
ul start
<li>Parent</li>
<li class="lev1">Child1</li>
<li class="lev1">Child2</li>
ul start
<li>Parent2</li>
<li class="lev1">Child4</li>
<li class="lev1">Child5</li>
however, as in most cases lxml and xpath is a superior solution:
from lxml import html
doc = html.parse('input.html')
print [ul.xpath('li[1] | li[2] | li[3]') for ul in doc.xpath('//ul')]
This should work .
import re
from BeautifulSoup import BeautifulSoup
html_doc = '<h3>HeaderName1</h3><ul class="prodoplist"><li>Parent</li><li class="lev1">Child1</li><li class="lev1">Child2</li><li class="lev1">Child3</li></ul> <h3>HeaderName2</h3><ul class="prodoplist"><li>Parent2</li><li class="lev1">Child4</li><li class="lev1">Child5</li><li class="lev1">Child6</li></ul>'
m = re.search(r'<h3>.*?<h3>', html_doc, re.DOTALL)
s = m.start()
e = m.end() - len('<h3>')
target_html = html_doc[s:e]
new_bs = BeautifulSoup(target_html)
ul_eles = new_bs.findAll('ul', attrs={'class' : 'prodoplist'})
for ul_ele in ul_eles:
li_eles = new_bs.findAll('li', attrs={'class' : 'lev1'})
for li_ele in li_eles:
print li_ele.text
import requests
from BeautifulSoup import BeautifulSoup
children = []
url = "http://someurl.html"
r = requests.get(url)
bs = BeautifulSoup(r.text)
for uls in bs.findAll('ul', 'prodoplist'):
lis = uls.findAll('li', 'lev1')
for li in lis:
children.append(li.text)
print children
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With