Removing html tags from a text using Regular Expression in python

Tags:

1 Answers

Use BeautifulSoup. Use lxml. Do not use regular expressions to parse HTML.

Edit 2010-01-29: This would be a reasonable starting point for lxml:

from lxml.html import fromstring
from lxml.html.clean import Cleaner
import requests

url = "https://stackoverflow.com/questions/2165943/removing-html-tags-from-a-text-using-regular-expression-in-python"
html = requests.get(url).text

doc = fromstring(html)

tags = ['h1','h2','h3','h4','h5','h6',
       'div', 'span', 
       'img', 'area', 'map']
args = {'meta':False, 'safe_attrs_only':False, 'page_structure':False, 
       'scripts':True, 'style':True, 'links':True, 'remove_tags':tags}
cleaner = Cleaner(**args)

path = '/html/body'
body = doc.xpath(path)[0]

print cleaner.clean_html(body).text_content().encode('ascii', 'ignore')

You want the content, so presumably you don't want any javascript or CSS. Also, presumably you want only the content in the body and not HTML from the head, too. Read up on lxml.html.clean to see what you can easily strip out. Way smarter than regular expressions, no?

Also, watch out for unicode encoding problems. You can easily end up with HTML that you cannot print.

2012-11-08: changed from using urllib2 to requests. Just use requests!

109

answered Oct 06 '22 01:10

hughdbrown

Related questions
                            
                                Investigating python process to see what's eating CPU
                            
                                Production ready Python implementations besides CPython? [closed]
                            
                                How the method resolution and invocation works internally in Python?
                            
                                More Pythonic conversion to binary?
                            
                                How to enumerate a list of non-string objects in Python?
                            
                                Python leaking memory while using PyQt and matplotlib
                            
                                Simulate multiple IP addresses for testing
                            
                                "WindowsError: exception: access violation..." - ctypes question
                            
                                Generate and parse Python code from C# application
                            
                                How to prevent log file truncation with python logging module?
                            
                                Creating a program to be broadcasted by avahi
                            
                                Help with JSON format [closed]
                            
                                Google App Engine compatibility layer
                            
                                Decoding Mac OS text in Python
                            
                                Feedparser - retrieve old messages from Google Reader
                            
                                Infinity generated in python code
                            
                                Editing the XML texts from a XML file using Python
                            
                                Interpolate Question
                            
                                Selecting specific column in each row from array
                            
                                stopping a cherrypy server over http

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Removing html tags from a text using Regular Expression in python

Tags:

python

html

regex

tags

Dan

People also ask

1 Answers

hughdbrown

Recent Activity

Donate For Us