I know what I'm trying to do is simple but it's causing me grief. I'd like pull data from HTML using BeautifulSoup. To do that I need to properly use the .find()
function. Here's the HTML I'm working with:
<div class="audit">
<div class="profile-info">
<img class="profile-pic" src="https://pbs.twimg.com/profile_images/471758097036226560/tLLeiOiL_normal.jpeg" />
<h4>Ed Boon</h4>
<span class="screen-name"><a href="http://www.twitter.com/noobde" target="_blank">@noobde</a></span>
</div>
<div class="followers">
<div class="pie"></div>
<div class="pie-data">
<span class="real number" data-value=73599>73,599</span><span class="real"> Real</span><br />
<span class="fake number" data-value=32452>32,452</span><span class="fake"> Fake</span><br />
<h6>Followers</h6>
</div>
</div>
<div class="score">
<img src="//twitteraudit-prod.s3.amazonaws.com/dist/f977287de6281fe3e1ef36d48d996fb83dd6a876/img/audit-result-good.png" />
<div class="percentage good">
69%
</div>
<h6>Audit score</h6>
The values I want are 73599
from data-value=73599
, 32352
from data-value=32452
, and the 69%
from percentage good
.
Using past code and online examples, this is what I have so far:
RealValue = soup.find("div", {"class":"real number"})['data-value']
FakeValue = soup.find("audit", {"class":"fake number"})['data-value']
Both so far to no effect. I'm not sure how to craft the find in order to pull the 69%
number.
find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.
Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
But the most used method for searching the parse tree is the find() and find_all() method. With the help of this, we can parse the HTML tree using Beautifulsoup.
soup.find("div", {"class":"real number"})['data-value']
Here you are searching for a div
element, but the span
has the "real number" class in your example HTML data, try instead:
soup.find("span", {"class": "real number", "data-value": True})['data-value']
Here we are also checking for presence of data-value
attribute.
To find elements having "real number" or "fake number" classes, you can make a CSS selector:
for elm in soup.select(".real.number,.fake.number"):
print(elm.get("data-value"))
To get the 69%
value:
soup.find("div", {"class": "percentage good"}).get_text(strip=True)
Or, a CSS selector:
soup.select_one(".percentage.good").get_text(strip=True)
soup.select_one(".score .percentage").get_text(strip=True)
Or, locating the h6
element having Audit score
text and then getting the preceding sibling:
soup.find("h6", text="Audit score").previous_sibling.get_text(strip=True)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With