Using Beautiful Soup module, how can I get data of a div
tag whose class name is feeditemcontent cxfeeditemcontent
? Is it:
soup.class['feeditemcontent cxfeeditemcontent']
or:
soup.find_all('class')
This is the HTML source:
<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>
and this is the Python code:
from BeautifulSoup import BeautifulSoup
html_doc = open('home.jsp.html', 'r')
soup = BeautifulSoup(html_doc)
class="feeditemcontent cxfeeditemcontent"
Find element by class using CSS Selector Alternatively, you can search for HTML tags by class name using a CSS selector with BeautifulSoup select() method. Using the select method allows you to match tags that also have another CSS class other than “quote”.
See also the documentation on formatters; you'll most likely either use formatter="minimal" (the default) or formatter="html" (for html entities) unless you want to manually process the text in some way. encode_contents returns an encoded bytestring. If you want a Python Unicode string then use decode_contents instead.
Beautiful Soup 4 treats the value of the "class" attribute as a list rather than a string, meaning jadkik94's solution can be simplified:
from bs4 import BeautifulSoup
def match_class(target):
def do_match(tag):
classes = tag.get('class', [])
return all(c in classes for c in target)
return do_match
soup = BeautifulSoup(html)
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))
Try this, maybe it's too much for this simple thing but it works:
def match_class(target):
target = target.split()
def do_match(tag):
try:
classes = dict(tag.attrs)["class"]
except KeyError:
classes = ""
classes = classes.split()
return all(c in classes for c in target)
return do_match
html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
print m
print "-"*10
matches = soup.findAll(match_class("feeditembody"))
for m in matches:
print m
print "-"*10
soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")
So, If I want to get all div tags of class header <div class="header">
from stackoverflow.com, an example with BeautifulSoup would be something like:
from bs4 import BeautifulSoup as bs
import requests
url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)
tags = soup.findAll("div", class_="header")
It is already in bs4 documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With