<p>Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc.</p> <p>Imagine everyone takes 3-5 classes. One of them is always Biology. Their report card is always alphabetized. I want everybody's Biology grade.</p> <p>I've already scraped all this HTML into a text, now how to fish out the Biology grades?</p> <pre class="prettyprint"><code><div class = "student"> <div class = "score">Algebra C-</div> <div class = "score">Biology A+</div> <div class = "score">Chemistry B</div> </div> <div class = "student"> <div class = "score">Biology B</div> <div class = "score">Chemistry A</div> </div> <div class = "student"> <div class = "score">Alchemy D</div> <div class = "score">Algebra A</div> <div class = "score">Biology B</div> </div> <div class = "student"> <div class = "score">Algebra A</div> <div class = "score">Biology B</div> <div class = "score">Chemistry C+</div> </div> <div class = "student"> <div class = "score">Alchemy D</div> <div class = "score">Algebra A</div> <div class = "score">Bangladeshi History C</div> <div class = "score">Biology B</div> </div> </code></pre> <p>I'm using beautiful soup, and I think I'm going to have to find divs where Text includes "Biology"?</p> <p>This is only for a quick scrape and I'm open to hard-coding and fiddling in Excel or whatnot. Yes, it's a shoddy website! Yes, they do have an API, and I don't know a thing about WDSL.</p> <p>Short version: http://www.legis.ga.gov/Legislation/en-US/Search.aspx ,to find the date of last action on every bill, FWIW. It's troublesome because if a bill has no sponsors in the second chamber, instead of a div containing nothing, they just don't have a div there at all. So sometimes the timeline is in div 3, sometimes 2, etc.</p>

<p>(1) To just get the biology grade only, it is almost one liner.</p> <pre class="prettyprint"><code>import bs4, re soup = bs4.BeautifulSoup(html) scores_string = soup.find_all(text=re.compile('Biology')) scores = [score_string.split()[-1] for score_string in scores_string] print scores_string print scores </code></pre> <p>The output looks like this:</p> <pre class="prettyprint"><code>[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B'] [u'A+', u'B', u'B', u'B', u'B'] </code></pre> <p>(2) You locate the tags and maybe for further tasks, you need to find the <code>parent</code>:</p> <pre class="prettyprint"><code>import bs4, re soup = bs4.BeautifulSoup(html) scores = soup.find_all(text=re.compile('Biology')) divs = [score.parent for score in scores] print divs </code></pre> <p>Output looks like this:</p> <pre class="prettyprint"><code>[<div class="score">Biology A+</div>, <div class="score">Biology B</div>, <div class="score">Biology B</div>, <div class="score">Biology B</div>, <div class="score">Biology B</div>] </code></pre> <p>*<strong>In conclusion, you can use find_siblings/parent/...etc to move around the HTML tree.*</strong> </p> <p>More information about how to navigate the tree. And Good luck with your work.</p>

<p>Another way (using css selector) is:</p> <p><code>divs = soup.select('div:-soup-contains("Biology")')</code></p> <p>EDIT:</p> <p><strong>BeautifulSoup4 4.7.0+</strong> (SoupSieve) <strong>is required</strong></p>

How to select div by text content using Beautiful Soup?

Trying to scrape some HTML from something like this. Sometimes the data I need is in div[0], sometimes div[1], etc.

Imagine everyone takes 3-5 classes. One of them is always Biology. Their report card is always alphabetized. I want everybody's Biology grade.

I've already scraped all this HTML into a text, now how to fish out the Biology grades?

<div class = "student">
    <div class = "score">Algebra C-</div>
    <div class = "score">Biology A+</div>
    <div class = "score">Chemistry B</div>
</div>
<div class = "student">
    <div class = "score">Biology B</div>
    <div class = "score">Chemistry A</div>
</div>
<div class = "student">
    <div class = "score">Alchemy D</div>
    <div class = "score">Algebra A</div>
    <div class = "score">Biology B</div>
</div>
<div class = "student">
    <div class = "score">Algebra A</div>
    <div class = "score">Biology B</div>
    <div class = "score">Chemistry C+</div>
</div>
<div class = "student">
    <div class = "score">Alchemy D</div>
    <div class = "score">Algebra A</div>
    <div class = "score">Bangladeshi History C</div>
    <div class = "score">Biology B</div>
</div>

I'm using beautiful soup, and I think I'm going to have to find divs where Text includes "Biology"?

This is only for a quick scrape and I'm open to hard-coding and fiddling in Excel or whatnot. Yes, it's a shoddy website! Yes, they do have an API, and I don't know a thing about WDSL.

Short version: http://www.legis.ga.gov/Legislation/en-US/Search.aspx ,to find the date of last action on every bill, FWIW. It's troublesome because if a bill has no sponsors in the second chamber, instead of a div containing nothing, they just don't have a div there at all. So sometimes the timeline is in div 3, sometimes 2, etc.

How do you find a specific text tag in BeautifulSoup?

To find elements that contain a specific text in Beautiful Soup, we can use find_all(~) method together with a lambda function.

(1) To just get the biology grade only, it is almost one liner.

import bs4, re
soup = bs4.BeautifulSoup(html)
scores_string = soup.find_all(text=re.compile('Biology')) 
scores = [score_string.split()[-1] for score_string in scores_string]
print scores_string
print scores

The output looks like this:

[u'Biology A+', u'Biology B', u'Biology B', u'Biology B', u'Biology B']
[u'A+', u'B', u'B', u'B', u'B']

(2) You locate the tags and maybe for further tasks, you need to find the parent:

import bs4, re
soup = bs4.BeautifulSoup(html)
scores = soup.find_all(text=re.compile('Biology'))
divs = [score.parent for score in scores]
print divs

Output looks like this:

[<div class="score">Biology A+</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>, 
<div class="score">Biology B</div>]

*In conclusion, you can use find_siblings/parent/...etc to move around the HTML tree.*

More information about how to navigate the tree. And Good luck with your work.

Another way (using css selector) is:

divs = soup.select('div:-soup-contains("Biology")')

EDIT:

BeautifulSoup4 4.7.0+ (SoupSieve) is required

You can extract them searching for any <div> element that has score as class attribute value, and use a regular expression to extract its biology score:

from bs4 import BeautifulSoup 
import sys
import re

soup = BeautifulSoup(open(sys.argv[1], 'r'), 'html')

for div in soup.find_all('div', attrs={'class': 'score'}):
    t = re.search(r'Biology\s+(\S+)', div.string)
    if t: print(t.group(1))

Run it like:

python3 script.py htmlfile

That yields:

A+
B
B
B
B

How to select div by text content using Beautiful Soup?

Tags:

html

beautifulsoup

web-scraping

Maggie

People also ask

3 Answers

B.Mr.W.

Anar Salimkhanov

Birei

Recent Activity

Donate For Us

How to select div by text content using Beautiful Soup?

Tags:

html

beautifulsoup

web-scraping

Maggie

People also ask

3 Answers

B.Mr.W.

Anar Salimkhanov

Birei

Related questions

Recent Activity

Donate For Us