Is there a way to get CSS classes from an HTML file using <code>BeautifulSoup</code>? Example snippet: <pre class="prettyprint"><code><style type="text/css"> p.c3 {text-align: justify} p.c2 {text-align: left} p.c1 {text-align: center} </style> </code></pre> Perfect output would be: <pre class="prettyprint"><code>cssdict = { 'p.c3': {'text-align': 'justify'}, 'p.c2': {'text-align': 'left'}, 'p.c1': {'text-align': 'center'} } </code></pre> although something like this would do: <pre class="prettyprint"><code>L = [ ('p.c3', {'text-align': 'justify'}), ('p.c2', {'text-align': 'left'}), ('p.c1', {'text-align': 'center'}) ] </code></pre>

BeautifulSoup itself doesn't parse CSS style declarations at all, but you can extract such sections then parse them with a dedicated CSS parser. Depending on your needs, there are several CSS parsers available for python; I'd pick cssutils (requires python 2.5 or up (including python 3)), it is the most complete in it's support, and supports inline styles too. Other options are css-py and tinycss. To grab and parse such all style sections (example with cssutils): <pre class="prettyprint"><code>import cssutils sheets = [] for styletag in tree.findAll('style', type='text/css') if not styletag.string: # probably an external sheet continue sheets.append(cssutils.parseStyle(styletag.string)) </code></pre> With <code>cssutil</code> you can then combine these, resolve imports, and even have it fetch external stylesheets.

A BeautifulSoup & cssutils combo will do the trick nicely: <pre class="prettyprint"><code> from bs4 import BeautifulSoup as BSoup import cssutils selectors = {} with open(htmlfile) as webpage: html = webpage.read() soup = BSoup(html, 'html.parser') for styles in soup.select('style'): css = cssutils.parseString(styles.encode_contents()) for rule in css: if rule.type == rule.STYLE_RULE: style = rule.selectorText selectors[style] = {} for item in rule.style: propertyname = item.name value = item.value selectors[style][propertyname] = value </code></pre> BeautifulSoup parses all "style" tags in the html (head & body), .encode_contents() converts the BeautifulSoup objects into a byte format that cssutils can read, and then cssutils parses the individual CSS styles all the way down to the property/value level via rule.selectorText & rule.style. Note: The "rule.STYLE_RULE" filters out only styles. The cssutils documentation details options for filtering media rules, comments and imports. It'd be cleaner if you broke this down into functions, but you get the gist...

BeautifulSoup: get css classes from html

Tags:

python

html

css

beautifulsoup

Is there a way to get CSS classes from an HTML file using BeautifulSoup? Example snippet:

Click to copy

<style type="text/css">

 p.c3 {text-align: justify}

 p.c2 {text-align: left}

 p.c1 {text-align: center}

</style>

Perfect output would be:

Click to copy

cssdict = {
    'p.c3': {'text-align': 'justify'},
    'p.c2': {'text-align': 'left'},
    'p.c1': {'text-align': 'center'}
}

although something like this would do:

Click to copy

L = [
    ('p.c3', {'text-align': 'justify'}),  
    ('p.c2', {'text-align': 'left'}),    
    ('p.c1', {'text-align': 'center'})
]

930

asked Jul 16 '12 09:07

root

2 Answers

BeautifulSoup itself doesn't parse CSS style declarations at all, but you can extract such sections then parse them with a dedicated CSS parser.

Depending on your needs, there are several CSS parsers available for python; I'd pick cssutils (requires python 2.5 or up (including python 3)), it is the most complete in it's support, and supports inline styles too.

Other options are css-py and tinycss.

To grab and parse such all style sections (example with cssutils):

Click to copy

import cssutils
sheets = []
for styletag in tree.findAll('style', type='text/css')
    if not styletag.string: # probably an external sheet
        continue
    sheets.append(cssutils.parseStyle(styletag.string))

With cssutil you can then combine these, resolve imports, and even have it fetch external stylesheets.

193

answered Oct 13 '22 02:10

Martijn Pieters

A BeautifulSoup & cssutils combo will do the trick nicely:

Click to copy

    from bs4 import BeautifulSoup as BSoup
    import cssutils
    selectors = {}
    with open(htmlfile) as webpage:
        html = webpage.read()
        soup = BSoup(html, 'html.parser')
    for styles in soup.select('style'):
        css = cssutils.parseString(styles.encode_contents())
        for rule in css:
            if rule.type == rule.STYLE_RULE:
                style = rule.selectorText
                selectors[style] = {}
                for item in rule.style:
                    propertyname = item.name
                    value = item.value
                    selectors[style][propertyname] = value

BeautifulSoup parses all "style" tags in the html (head & body), .encode_contents() converts the BeautifulSoup objects into a byte format that cssutils can read, and then cssutils parses the individual CSS styles all the way down to the property/value level via rule.selectorText & rule.style.

Note: The "rule.STYLE_RULE" filters out only styles. The cssutils documentation details options for filtering media rules, comments and imports.

It'd be cleaner if you broke this down into functions, but you get the gist...

answered Oct 13 '22 02:10

Cory Smith

Related questions
                            
                                Python data scraping
                            
                                Producing a printable calendar with Python
                            
                                Converting float.hex() value to binary in Python
                            
                                Qt Widget with Transparent Background
                            
                                python: re.sub's replace function doesn't accept extra arguments - how to avoid global variable?
                            
                                How to implement a minimal class that behaves like a sequence in Python?
                            
                                Python libraries for integrating Django with Facebook
                            
                                Sort a numpy array like a table
                            
                                Styling the popup of a QCompleter in PyQt
                            
                                Assigning list to one value in that list
                            
                                fast data move from file to some StringIO
                            
                                Transform a python dict into string compatible with Content-Type:"application/x-www-form-urlencoded"
                            
                                How can I implement a simple web server using Python without using any libraries?
                            
                                python function that returns a variable number of outputs
                            
                                matplotlib, can plot but not scatter
                            
                                Screenshot of a window using python
                            
                                Read argument with spaces in python script from a shell script
                            
                                How can i use xaxis_date() with barh()?
                            
                                How to check source code of a python method?
                            
                                Combining multiple regex substitutions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BeautifulSoup: get css classes from html

Tags:

python

html

css

beautifulsoup

root

People also ask

2 Answers

Martijn Pieters

Cory Smith

Recent Activity

Donate For Us