Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing certain tags with beautifulsoup and python

Question

I am trying to remove style tags like <h2> and <div class=...> from my html file which is being downloaded by BeautifulSoup. I do want to keep what the tags contain (like text) However this does not seem to work.

What i have tried

for url in urls:
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.content, 'html.parser')
    table = soup.find("div", {"class": "product_specifications bottom_l js_readmore_content"})
    print "<hr style='border-width:5px;'>"
    for style in table.find_all('style'):
        if 'style' in style.attrs:
            del style.attrs['style']
    print table

Urls i tried to work with

Python HTML parsing with beautiful soup and filtering stop words

Remove class attribute from HTML using Python and lxml

BeautifulSoup Tag Removal

like image 951
user3671459 Avatar asked Oct 07 '14 09:10

user3671459


2 Answers

You can use decompose(): http://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose

If you want to clear just text or keep element removed from tree, use clear and extract (description just above decompose).

like image 94
m.wasowski Avatar answered Sep 17 '22 21:09

m.wasowski


You are looking for unwrap().

your_soup.tag.unwrap()

like image 22
Bishwas Mishra Avatar answered Sep 18 '22 21:09

Bishwas Mishra