Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What would I use to remove escaped html from large sets of data

Tags:

mysql

perl

Our database is filled with articles retrieved from RSS feeds. I was unsure of what data I would be getting, and how much filtering was already setup (WP-O-Matic Wordpress plugin using the SimplePie library). This plugin does some basic encoding before insertion using Wordpress's built in post insert function which also does some filtering. Between the RSS feed's encoding, the plugin's encoding using PHP, Wordpress's encoding and SQL escaping, I'm not sure where to start.

The data is usually at the end of the field after the content I want to keep. It is all on one line, but separated out for readability:

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:V_sGLiPBpWU" border="0"></img>

<img src="http://feeds.feedburner.com/~ff/SoundOnTheSound?d=qj6IDK7rITs" border="0"></img>

&lt;img src=&quot;http://feeds.feedburner.com/~ff/SoundOnTheSound?i=xFxEpT2Add0:xFbIkwGc-fk:D7DqB2pKExk&quot;

Notice how some of the images are escape and some aren't. I believe this has to do with the last part being cut off so as to be unrecognizable as an html tag, which then caused it to be html endcoded while the actual img tags were left alone.

Another record has only this in one of the fields, which means the RSS feed gave me nothing for the item (filtered out now, but I have a bunch of records like this):

&lt;img src=&quot;http://farm3.static.flickr.com/2183/2289902369_1d95bcdb85.jpg&quot; alt=&quot;post_img&quot; width=&quot;80&quot;

All extracted samples are on one line, but broken up for readability. Otherwise, they are copied exactly from the database from the command line mysql client.

Question: What is the best way to work with the above escaped html (or portion of an html tag), so I can then remove it without affecting the content?

I want to remove it, because the images at the end of the field are usually images that have nothing to do with content. In the case of the feedburner ones, feedburner adds those to every single article in a feed. Other times, they're broken links surrounding broken images. The point is not the valid html img tags which can be removed easily. It's the mangled tags which if unencoded will not be valid html, which will not be parsable with your standard html parsers.

[EDIT] If it was just a matter of pulling the html I wanted out and doing a strip_tags and reinserting the data, I wouldn't be asking this question.

The portion that I have a problem with is that what used to be an img tag was html encoded and the end cut off. If it's deencoded it will not be an html tag, so I cannot parse it the usual way.

With all the &lt;img src=&quot; crap, I can't get my head around searching for it other than SELECT ID, post_content FROM table WHERE post_content LIKE '&lt;img' which at least gets me those posts. But when I get the data, I need a way to find it, remove it, but keep the rest of the content.

[/EDIT]

[EDIT 2]

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.<div> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:yIl2AUoC8zA"><img src="http://feeds.feedburner.com/~ff/bizj_portland?d=yIl2AUoC8zA" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:V_sGLiPBpWU" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo"><img src="http://feeds.feedburner.com/~ff/bizj_portland?i=YIs66yw13JE:_zirAnH6dt8:F7zBnMyn0Lo" border="0"></img></a> <a href="http://feeds.bizjournals.com/~ff/bizj_portland?a=YIs66yw13JE:_zirAnH6dt8:qj6IDK7rITs">&lt;img src=&quot;http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs&quot;

The part I want to keep:

<img src="http://farm4.static.flickr.com/3162/2735565872_b8a4e4bd17.jpg" alt="post_img" width="80" />Through the first two months of the year, the volume of cargo handled at Port of Portland terminals has increased 46 percent as the port?s marine cargo business shows signs of recovering from a dismal 2009.

To reiterate: It's not about removing the valid html img tags. That's easy. I need to be able to find specifically the &lt;img src=&quot;http://feeds.feedburner.com/~ff/bizj_portland?d=qj6IDK7rITs&quot; if it's part of the pattern of img tag img tag mangled img tag or anchor img anchor img img mangled image etc etc, but not remove &lt;img if it is indeed part of the article. Out of the few dozen samples I've reviewed, it's been pretty consistent that this mangled img tag is at the end of the field.

The other one is the single mangled image tag. It's consistently a mangled flickr img tag, but as above, I can't just search for &lt;img as it could be a valid part of the content.

The problem lies in that I can't simply decode it and parse it as HTML, because it will not be valid html. [/EDIT 2]

like image 990
Elizabeth Buckwalter Avatar asked Apr 13 '10 17:04

Elizabeth Buckwalter


1 Answers

The best way is to:

  1. Install HTML::Entities from CPAN and use that to unescape the URIs.
  2. Install HTML::Parser from CPAN and use that to parse and remove the URIs after they're unescaped.

Regexes are not a suitable tool for this task.

like image 145
Dave Sherohman Avatar answered Nov 15 '22 08:11

Dave Sherohman