Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: get css classes from html

Is there a way to get CSS classes from an HTML file using BeautifulSoup? Example snippet:

<style type="text/css">

 p.c3 {text-align: justify}

 p.c2 {text-align: left}

 p.c1 {text-align: center}

</style>

Perfect output would be:

cssdict = {
    'p.c3': {'text-align': 'justify'},
    'p.c2': {'text-align': 'left'},
    'p.c1': {'text-align': 'center'}
}

although something like this would do:

L = [
    ('p.c3', {'text-align': 'justify'}),  
    ('p.c2', {'text-align': 'left'}),    
    ('p.c1', {'text-align': 'center'})
]
like image 930
root Avatar asked Jul 16 '12 09:07

root


People also ask

Can I use CSS selector with BeautifulSoup?

BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.

Can you scrape CSS?

CSS Selectors are very common in web data scraping using Agenty chrome extension. You can use the CSS selector to extract any content from the HTML pages. Selectors are the part of the CSS rule set and select HTML elements according to its Id, class, type, attribute or pseudo-classes.


2 Answers

BeautifulSoup itself doesn't parse CSS style declarations at all, but you can extract such sections then parse them with a dedicated CSS parser.

Depending on your needs, there are several CSS parsers available for python; I'd pick cssutils (requires python 2.5 or up (including python 3)), it is the most complete in it's support, and supports inline styles too.

Other options are css-py and tinycss.

To grab and parse such all style sections (example with cssutils):

import cssutils
sheets = []
for styletag in tree.findAll('style', type='text/css')
    if not styletag.string: # probably an external sheet
        continue
    sheets.append(cssutils.parseStyle(styletag.string))

With cssutil you can then combine these, resolve imports, and even have it fetch external stylesheets.

like image 193
Martijn Pieters Avatar answered Oct 13 '22 02:10

Martijn Pieters


A BeautifulSoup & cssutils combo will do the trick nicely:

    from bs4 import BeautifulSoup as BSoup
    import cssutils
    selectors = {}
    with open(htmlfile) as webpage:
        html = webpage.read()
        soup = BSoup(html, 'html.parser')
    for styles in soup.select('style'):
        css = cssutils.parseString(styles.encode_contents())
        for rule in css:
            if rule.type == rule.STYLE_RULE:
                style = rule.selectorText
                selectors[style] = {}
                for item in rule.style:
                    propertyname = item.name
                    value = item.value
                    selectors[style][propertyname] = value

BeautifulSoup parses all "style" tags in the html (head & body), .encode_contents() converts the BeautifulSoup objects into a byte format that cssutils can read, and then cssutils parses the individual CSS styles all the way down to the property/value level via rule.selectorText & rule.style.

Note: The "rule.STYLE_RULE" filters out only styles. The cssutils documentation details options for filtering media rules, comments and imports.

It'd be cleaner if you broke this down into functions, but you get the gist...

like image 33
Cory Smith Avatar answered Oct 13 '22 02:10

Cory Smith