Removing html image tags and everything in between from a string

Tags:

I've seen a number of questions about removing HTML tags from strings, but I'm still a bit unclear on how my specific case should be handled.

I've seen that many posts advise against using regular expressions to handle HTML, but I suspect my case may warrant judicious circumvention of this rule.

I'm trying to parse PDF files and I've successfully managed to convert each page from my sample PDF file into a string of UTF-32 text. When images appear, an HTML-style tag is inserted which contains the name and location of the image (which is saved elsewhere).

In a separate portion of my app, I need to get rid of these image tags. Because we're only dealing with image tags, I suspect the use of a regex may be warranted.

My question is twofold:

Should I use a regex to remove these tags, or should I still use an HTML parsing module such as BeautifulSoup?
Which regex or BeautifulSoup construct should I use? In other words, how should I code this?

For clarity, the tags are structured as <img src="/path/to/file"/>

Thanks!

361

asked May 07 '12 17:05

Louis Thibault

2 Answers

I would vote that in your case it is acceptable to use a regular expression. Something like this should work:

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

I found that snippet here (http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html)

edit: version which will only remove things of the form <img .... />:

def remove_img_tags(data):
    p = re.compile(r'<img.*?/>')
    return p.sub('', data)

answered Oct 13 '22 12:10

joshcartme

Since this text contains only image tags, it's probably OK to use a regex. But for anything else you're probably better off using a bonafide HTML parser. Fortunately Python provides one! This is pretty bare-bones -- to be fully functional, this would have to handle a lot more corner cases. (Most notably, XHTML-style empty tags (ending with a slash <... />) aren't handled correctly here.)

>>> from HTMLParser import HTMLParser
>>> 
>>> class TagDropper(HTMLParser):
...     def __init__(self, tags_to_drop, *args, **kwargs):
...         HTMLParser.__init__(self, *args, **kwargs)
...     self._text = []
...         self._tags_to_drop = set(tags_to_drop)
...     def clear_text(self):
...         self._text = []
...     def get_text(self):
...         return ''.join(self._text)
...     def handle_starttag(self, tag, attrs):
...         if tag not in self._tags_to_drop:
...             self._text.append(self.get_starttag_text())
...     def handle_endtag(self, tag):
...         self._text.append('</{0}>'.format(tag))
...     def handle_data(self, data):
...         self._text.append(data)
... 
>>> td = TagDropper([])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an <img url="foo"> tag
Another line of text with a <br> tag

And to drop img tags...

>>> td = TagDropper(['img'])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an  tag
Another line of text with a <br> tag

answered Oct 13 '22 11:10

senderle

Related questions
                            
                                match until a certain pattern using regex
                            
                                Convert integer to hex-string with specific format
                            
                                Difference between Python print/format methods
                            
                                Python: How to get the created date and time of a folder? [duplicate]
                            
                                JSON to model a class using Django
                            
                                Python equivalent of vector::reserve()
                            
                                Mesh Generation for Computational Science in Python
                            
                                Python - short way to unpack list for string formatting operator?
                            
                                How to pass variables with spaces through URL in :Django
                            
                                Get required fields from Document in mongoengine?
                            
                                Building OpenCV libraries from source files
                            
                                Pointfree function combination in Python
                            
                                Python: __str__, but for a class, not an instance?
                            
                                Why are some mysql connections selecting old data the mysql database after a delete + insert?
                            
                                how to using python to diff two html files
                            
                                Running a linux command from python
                            
                                Django Custom Save Model
                            
                                Python max with same number of instances
                            
                                Recursive Generators in Python
                            
                                Update DynamoDB Atomic Counter with Python / Boto

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Removing html image tags and everything in between from a string

Tags:

python

html

regex

beautifulsoup

Louis Thibault

People also ask

2 Answers

joshcartme

senderle

Recent Activity

Donate For Us