Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find div text through div label with beautifulsoup

Assume the following html snippet, from which I would like to extract the values corresponding to the labels 'price' and 'ships from':

<div class="divName">
    <div>
        <label>Price</label>
        <div>22.99</div>
    </div>
    <div>
        <label>Ships from</label>
        <span>EU</span>
    </div>
</div>

Which is part of a larger html file. Assume that in some files the 'Ships from' label is present, sometimes not. I would like to use BeautifulSoup, of a similar approach, to deal with this, because of the variability of the html content. Multiple div and span are present, which makes it hard to select without id or class name

My thoughts, something like this:

t = open('snippet.html', 'rb').read().decode('iso-8859-1')
s = BeautifulSoup(t, 'lxml')
s.find('div.divName[label*=Price]')
s.find('div.divName[label*=Ships from]')

However, this returns an empty list.

like image 657
Jeroen Avatar asked May 22 '19 08:05

Jeroen


People also ask

How do you scrape text using Beautiful Soup?

For web scraping to work in Python, we're going to perform three basic steps: Extract the HTML content using the requests library. Analyze the HTML structure and identify the tags which have our content. Extract the tags using Beautiful Soup and put the data in a Python list.

What method in Beautiful Soup will get the text from an element object?

BeautifulSoup has a built-in method to parse the text out of an element, which is get_text() . In order to use it, you can simply call the method on any Tag or BeautifulSoup object. get_text() does not work on NavigableString because the object itself represents a string.

How to extract a Div by its ID in Beautiful Soup?

This article depicts how beautifulsoup can be employed to extract a div and its content by its ID. For this, find () function of the module is used to find the div by its ID. The tag_name argument tell Beautiful Soup to only find tags with given names. Text strings will be ignored, as will tags whose names that don’t match.

How to parse HTML tags in beautifulsoup?

Here we first import the regular expressions and BeautifulSoup libraries. Then we open the HTML file using the open function which we want to parse. Then using the find_all function, we find a particular tag that we pass inside that function and also the text we want to have within the tag.

How to find a label element by text in beautifulsoup?

In this case, you can locate the label element by text and then use .next_sibling property: Prints John Smith. BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select () method to find multiple elements and select_one () to find a single element.

How do I use CSS selectors in beautifulsoup?

BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select () method to find multiple elements and select_one () to find a single element. To locate comments in BeautifulSoup, use the text (or string in the recent versions) argument checking the type to be Comment:


3 Answers

Use select to find label and then use find_next_sibling().text

Ex:

from bs4 import BeautifulSoup

html = """<div class="divName">
    <div>
        <label>Price</label>
        <div>22.99</div>
    </div>
    <div>
        <label>Ships from</label>
        <span>EU</span>
    </div>
</div>"""

soup = BeautifulSoup(html, "html.parser")
for lab in soup.select("label"):
    print(lab.find_next_sibling().text)

Output:

22.99
EU
like image 133
Rakesh Avatar answered Oct 22 '22 15:10

Rakesh


You can use :contains (with bs 4.7.1 and next_sibling

import requests
from bs4 import BeautifulSoup as bs

html = '''
<div class="divName">
    <div>
        <label>Price</label>
        <div>22.99</div>
    </div>
    <div>
        <label>Ships from</label>
        <span>EU</span>
    </div>
</div>
'''

soup = bs(html, 'lxml')
items = soup.select('label:contains(Price), label:contains("Ships from")')

for item in items:
    print(item.text, item.next_sibling.next_sibling.text)
like image 21
QHarr Avatar answered Oct 22 '22 13:10

QHarr


Try this :

from bs4 import BeautifulSoup
from bs4.element import Tag

html = """ <div class="divName">
    <div>
        <label>Price</label>
        <div>22.99</div>
    </div>
    <div>
        <label>Ships from</label>
        <span>EU</span>
    </div>
</div>"""

s = BeautifulSoup(html, 'lxml')
row = s.find(class_='divName')

Solutio-1 :

for tag in row.findChildren():
    if len(tag) > 1:
        continue
    if tag.name in 'span' and isinstance(tag, Tag):
        print(tag.text)
    elif tag.name in 'div' and isinstance(tag, Tag):
        print(tag.text)

Solution-2:

for lab in row.select("label"):
    print(lab.find_next_sibling().text)

O/P:

22.99
EU
like image 1
bharatk Avatar answered Oct 22 '22 15:10

bharatk