Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get contents by class names using Beautiful Soup

Using Beautiful Soup module, how can I get data of a div tag whose class name is feeditemcontent cxfeeditemcontent? Is it:

soup.class['feeditemcontent cxfeeditemcontent']

or:

soup.find_all('class')

This is the HTML source:

<div class="feeditemcontent cxfeeditemcontent">
    <div class="feeditembodyandfooter">
         <div class="feeditembody">
         <span>The actual data is some where here</span>
         </div>
     </div>
 </div> 

and this is the Python code:

 from BeautifulSoup import BeautifulSoup
 html_doc = open('home.jsp.html', 'r')

 soup = BeautifulSoup(html_doc)
 class="feeditemcontent cxfeeditemcontent"
like image 826
Rajeev Avatar asked Jul 04 '12 14:07

Rajeev


People also ask

How do you find the element by class name in BeautifulSoup?

Find element by class using CSS Selector Alternatively, you can search for HTML tags by class name using a CSS selector with BeautifulSoup select() method. Using the select method allows you to match tags that also have another CSS class other than “quote”.

How do you get Innerhtml in BeautifulSoup?

See also the documentation on formatters; you'll most likely either use formatter="minimal" (the default) or formatter="html" (for html entities) unless you want to manually process the text in some way. encode_contents returns an encoded bytestring. If you want a Python Unicode string then use decode_contents instead.


3 Answers

Beautiful Soup 4 treats the value of the "class" attribute as a list rather than a string, meaning jadkik94's solution can be simplified:

from bs4 import BeautifulSoup                                                   

def match_class(target):                                                        
    def do_match(tag):                                                          
        classes = tag.get('class', [])                                          
        return all(c in classes for c in target)                                
    return do_match                                                             

soup = BeautifulSoup(html)                                                      
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))
like image 180
Leonard Richardson Avatar answered Sep 21 '22 15:09

Leonard Richardson


Try this, maybe it's too much for this simple thing but it works:

def match_class(target):
    target = target.split()
    def do_match(tag):
        try:
            classes = dict(tag.attrs)["class"]
        except KeyError:
            classes = ""
        classes = classes.split()
        return all(c in classes for c in target)
    return do_match

html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)

matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
    print m
    print "-"*10

matches = soup.findAll(match_class("feeditembody"))
for m in matches:
    print m
    print "-"*10
like image 33
jadkik94 Avatar answered Sep 17 '22 15:09

jadkik94


soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")

So, If I want to get all div tags of class header <div class="header"> from stackoverflow.com, an example with BeautifulSoup would be something like:

from bs4 import BeautifulSoup as bs
import requests 

url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)

tags = soup.findAll("div", class_="header")

It is already in bs4 documentation.

like image 36
Aziz Alto Avatar answered Sep 19 '22 15:09

Aziz Alto