Assume the following html snippet, from which I would like to extract the values corresponding to the labels 'price' and 'ships from':
<div class="divName">
<div>
<label>Price</label>
<div>22.99</div>
</div>
<div>
<label>Ships from</label>
<span>EU</span>
</div>
</div>
Which is part of a larger html file. Assume that in some files the 'Ships from' label is present, sometimes not. I would like to use BeautifulSoup, of a similar approach, to deal with this, because of the variability of the html content. Multiple div
and span
are present, which makes it hard to select without id or class name
My thoughts, something like this:
t = open('snippet.html', 'rb').read().decode('iso-8859-1')
s = BeautifulSoup(t, 'lxml')
s.find('div.divName[label*=Price]')
s.find('div.divName[label*=Ships from]')
However, this returns an empty list.
For web scraping to work in Python, we're going to perform three basic steps: Extract the HTML content using the requests library. Analyze the HTML structure and identify the tags which have our content. Extract the tags using Beautiful Soup and put the data in a Python list.
BeautifulSoup has a built-in method to parse the text out of an element, which is get_text() . In order to use it, you can simply call the method on any Tag or BeautifulSoup object. get_text() does not work on NavigableString because the object itself represents a string.
This article depicts how beautifulsoup can be employed to extract a div and its content by its ID. For this, find () function of the module is used to find the div by its ID. The tag_name argument tell Beautiful Soup to only find tags with given names. Text strings will be ignored, as will tags whose names that don’t match.
Here we first import the regular expressions and BeautifulSoup libraries. Then we open the HTML file using the open function which we want to parse. Then using the find_all function, we find a particular tag that we pass inside that function and also the text we want to have within the tag.
In this case, you can locate the label element by text and then use .next_sibling property: Prints John Smith. BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select () method to find multiple elements and select_one () to find a single element.
BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select () method to find multiple elements and select_one () to find a single element. To locate comments in BeautifulSoup, use the text (or string in the recent versions) argument checking the type to be Comment:
Use select
to find label
and then use find_next_sibling().text
Ex:
from bs4 import BeautifulSoup
html = """<div class="divName">
<div>
<label>Price</label>
<div>22.99</div>
</div>
<div>
<label>Ships from</label>
<span>EU</span>
</div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
for lab in soup.select("label"):
print(lab.find_next_sibling().text)
Output:
22.99
EU
You can use :contains
(with bs 4.7.1 and next_sibling
import requests
from bs4 import BeautifulSoup as bs
html = '''
<div class="divName">
<div>
<label>Price</label>
<div>22.99</div>
</div>
<div>
<label>Ships from</label>
<span>EU</span>
</div>
</div>
'''
soup = bs(html, 'lxml')
items = soup.select('label:contains(Price), label:contains("Ships from")')
for item in items:
print(item.text, item.next_sibling.next_sibling.text)
Try this :
from bs4 import BeautifulSoup
from bs4.element import Tag
html = """ <div class="divName">
<div>
<label>Price</label>
<div>22.99</div>
</div>
<div>
<label>Ships from</label>
<span>EU</span>
</div>
</div>"""
s = BeautifulSoup(html, 'lxml')
row = s.find(class_='divName')
Solutio-1 :
for tag in row.findChildren():
if len(tag) > 1:
continue
if tag.name in 'span' and isinstance(tag, Tag):
print(tag.text)
elif tag.name in 'div' and isinstance(tag, Tag):
print(tag.text)
Solution-2:
for lab in row.select("label"):
print(lab.find_next_sibling().text)
O/P:
22.99
EU
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With