Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parse embedded css beautifulsoup

Is it possible to extract the embedded css properties from an html tag? For instance, suppose I want to find out what the vertical-align attribute for "s5" is.

I'm currently using beautifulsoup and have retrieved the span-tag with tag=soup.find(class_="s5"). I've tried tag.attrs["class"] but that just gives me s5, with no way to link it to the embedded style. Is it possible to do this in python? Every question of this sort that I've found involves parsing inline css styles.

<html>
    <head>
        <style type="text/css">
        * {margin:0; padding:0; text-indent:0; }
        .s5 {color: #000; font-family:Verdana, sans-serif; 
             font-style: normal; font-weight: normal; 
             text-decoration: none; font-size: 17.5pt; 
             vertical-align: 10pt;}
        </style>
    </head>

    <body>
        <p class="s1" style="padding-left: 7pt; text-indent: 0pt; text-align:left;">
        This is a sample sentence. <span class="s5"> 1</span>
        </p>
    </body>
</html>
like image 804
nodel Avatar asked Mar 04 '23 16:03

nodel


2 Answers

You can use a css parser like [cssutils][1]. I don't know if there is a function in the package itself to do something like this (can someone comment regarding this?), but i made a custom function to get it.

from bs4 import BeautifulSoup
import cssutils
html='''
<html>
    <head>
        <style type="text/css">
        * {margin:0; padding:0; text-indent:0; }
        .s5 {color: #000; font-family:Verdana, sans-serif;
             font-style: normal; font-weight: normal;
             text-decoration: none; font-size: 17.5pt;
             vertical-align: 10pt;}
        </style>
    </head>

    <body>
        <p class="s1" style="padding-left: 7pt; text-indent: 0pt; text-align:left;">
        This is a sample sentence. <span class="s5"> 1</span>
        </p>
    </body>
</html>
'''
def get_property(class_name,property_name):
    for rule in sheet:
        if rule.selectorText=='.'+class_name:
            for property in rule.style:
                if property.name==property_name:
                    return property.value
soup=BeautifulSoup(html,'html.parser')
sheet=cssutils.parseString(soup.find('style').text)
vl=get_property('s5','vertical-align')
print(vl)

Output

10pt

This is not perfect but maybe you can improve upon it. [1]: https://pypi.org/project/cssutils/

like image 174
Bitto Bennichan Avatar answered Mar 12 '23 22:03

Bitto Bennichan


To improve upon the cssutils answer:

For an inline style="..." tag:

import cssutils

# get the style from beautiful soup, like: 
# style = tag['style']
style = "color: hotpink; background-color:#ff0000; visibility:hidden"

parsed_style = cssutils.parseStyle(style)

Now use parsed_style like you would an dict:

print(parsed_style['color'])  # hotpink
print(parsed_style['background-color'])  # f00
print(parsed_style['visibility'])  # hidden

like image 34
luckydonald Avatar answered Mar 12 '23 22:03

luckydonald