I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(
So I started writing a bunch of queries that look like;
UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';
Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/
, then passing just the %1
part to the ascii decoder, and reconstructing the string...or something...
Shall I just press on with the queries? There will probably only be 40 or so of them.
Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities
Of course you need to have perl installed and pl/perl available.
1) First of all create the procedural language pl/perlu:
CREATE EXTENSION plperlu;
2) Then create a function like this:
CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
use HTML::Entities;
return decode_entities($_[0]);
$$ LANGUAGE plperlu;
3) Then you can use it like this:
select decode_html_entities('aaabbb&.... asasdasdasd …');
decode_html_entities
---------------------------
aaabbb&.... asasdasdasd …
(1 row)
You could use xpath (HTML-encoded content is the same as XML encoded content):
select
'AT&T' as input ,
(xpath('/z/text()', ('<z>' || 'AT&T' || '</z>')::xml))[1] as output
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With