Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing html image tags and everything in between from a string

I've seen a number of questions about removing HTML tags from strings, but I'm still a bit unclear on how my specific case should be handled.

I've seen that many posts advise against using regular expressions to handle HTML, but I suspect my case may warrant judicious circumvention of this rule.

I'm trying to parse PDF files and I've successfully managed to convert each page from my sample PDF file into a string of UTF-32 text. When images appear, an HTML-style tag is inserted which contains the name and location of the image (which is saved elsewhere).

In a separate portion of my app, I need to get rid of these image tags. Because we're only dealing with image tags, I suspect the use of a regex may be warranted.

My question is twofold:

  1. Should I use a regex to remove these tags, or should I still use an HTML parsing module such as BeautifulSoup?
  2. Which regex or BeautifulSoup construct should I use? In other words, how should I code this?

For clarity, the tags are structured as <img src="/path/to/file"/>

Thanks!

like image 361
Louis Thibault Avatar asked May 07 '12 17:05

Louis Thibault


People also ask

Which tag is used to remove all HTML tags from a string?

The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped.

Is it possible to remove the HTML tags from data?

PHP provides an inbuilt function to remove the HTML tags from the data. The strip_tags() function is an inbuilt function in PHP that removes the strings form HTML, XML and PHP tags. It accepts two parameters. This function returns a string with all NULL bytes, HTML, and PHP tags stripped from a given $str.

How do I strip a string in HTML?

To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.

How do I remove a specific tag from a string?

The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.


2 Answers

I would vote that in your case it is acceptable to use a regular expression. Something like this should work:

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

I found that snippet here (http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html)

edit: version which will only remove things of the form <img .... />:

def remove_img_tags(data):
    p = re.compile(r'<img.*?/>')
    return p.sub('', data)
like image 65
joshcartme Avatar answered Oct 13 '22 12:10

joshcartme


Since this text contains only image tags, it's probably OK to use a regex. But for anything else you're probably better off using a bonafide HTML parser. Fortunately Python provides one! This is pretty bare-bones -- to be fully functional, this would have to handle a lot more corner cases. (Most notably, XHTML-style empty tags (ending with a slash <... />) aren't handled correctly here.)

>>> from HTMLParser import HTMLParser
>>> 
>>> class TagDropper(HTMLParser):
...     def __init__(self, tags_to_drop, *args, **kwargs):
...         HTMLParser.__init__(self, *args, **kwargs)
...     self._text = []
...         self._tags_to_drop = set(tags_to_drop)
...     def clear_text(self):
...         self._text = []
...     def get_text(self):
...         return ''.join(self._text)
...     def handle_starttag(self, tag, attrs):
...         if tag not in self._tags_to_drop:
...             self._text.append(self.get_starttag_text())
...     def handle_endtag(self, tag):
...         self._text.append('</{0}>'.format(tag))
...     def handle_data(self, data):
...         self._text.append(data)
... 
>>> td = TagDropper([])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an <img url="foo"> tag
Another line of text with a <br> tag

And to drop img tags...

>>> td = TagDropper(['img'])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an  tag
Another line of text with a <br> tag
like image 27
senderle Avatar answered Oct 13 '22 11:10

senderle