I've seen a number of questions about removing HTML tags from strings, but I'm still a bit unclear on how my specific case should be handled.
I've seen that many posts advise against using regular expressions to handle HTML, but I suspect my case may warrant judicious circumvention of this rule.
I'm trying to parse PDF files and I've successfully managed to convert each page from my sample PDF file into a string of UTF-32 text. When images appear, an HTML-style tag is inserted which contains the name and location of the image (which is saved elsewhere).
In a separate portion of my app, I need to get rid of these image tags. Because we're only dealing with image tags, I suspect the use of a regex may be warranted.
My question is twofold:
For clarity, the tags are structured as <img src="/path/to/file"/>
Thanks!
The strip_tags() function strips a string from HTML, XML, and PHP tags. Note: HTML comments are always stripped.
PHP provides an inbuilt function to remove the HTML tags from the data. The strip_tags() function is an inbuilt function in PHP that removes the strings form HTML, XML and PHP tags. It accepts two parameters. This function returns a string with all NULL bytes, HTML, and PHP tags stripped from a given $str.
To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.
The HTML tags can be removed from a given string by using replaceAll() method of String class. We can remove the HTML tags from a given string by using a regular expression. After removing the HTML tags from a string, it will return a string as normal text.
I would vote that in your case it is acceptable to use a regular expression. Something like this should work:
def remove_html_tags(data):
p = re.compile(r'<.*?>')
return p.sub('', data)
I found that snippet here (http://love-python.blogspot.com/2008/07/strip-html-tags-using-python.html)
edit: version which will only remove things of the form <img .... />
:
def remove_img_tags(data):
p = re.compile(r'<img.*?/>')
return p.sub('', data)
Since this text contains only image tags, it's probably OK to use a regex. But for anything else you're probably better off using a bonafide HTML parser. Fortunately Python provides one! This is pretty bare-bones -- to be fully functional, this would have to handle a lot more corner cases. (Most notably, XHTML-style empty tags (ending with a slash <... />
) aren't handled correctly here.)
>>> from HTMLParser import HTMLParser
>>>
>>> class TagDropper(HTMLParser):
... def __init__(self, tags_to_drop, *args, **kwargs):
... HTMLParser.__init__(self, *args, **kwargs)
... self._text = []
... self._tags_to_drop = set(tags_to_drop)
... def clear_text(self):
... self._text = []
... def get_text(self):
... return ''.join(self._text)
... def handle_starttag(self, tag, attrs):
... if tag not in self._tags_to_drop:
... self._text.append(self.get_starttag_text())
... def handle_endtag(self, tag):
... self._text.append('</{0}>'.format(tag))
... def handle_data(self, data):
... self._text.append(data)
...
>>> td = TagDropper([])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an <img url="foo"> tag
Another line of text with a <br> tag
And to drop img
tags...
>>> td = TagDropper(['img'])
>>> td.feed('A line of text\nA line of text with an <img url="foo"> tag\nAnother line of text with a <br> tag\n')
>>> print td.get_text()
A line of text
A line of text with an tag
Another line of text with a <br> tag
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With