Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stripping HTML tags in PostgreSQL

How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?

I found some solutions by googling it but they were striping the text between the tags too!

like image 798
samach Avatar asked Aug 21 '12 07:08

samach


People also ask

How do I strip a tag in HTML?

How do you remove your HTML Code from a given HTML URL? Users can copy and paste HTML code using the view source of the URL, or click on the URL button and enter the URL and click on Strip HTML Button.

How do I strip a string in HTML?

To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.

What does it mean to strip HTML?

stripHtml( html ) Changes the provided HTML string into a plain text string by converting <br> , <p> , and <div> to line breaks, stripping all other tags, and converting escaped characters into their display values.


2 Answers

select regexp_replace(content, E'<[^>]+>', '', 'gi') from message;
like image 189
acohen Avatar answered Sep 21 '22 20:09

acohen


Use xpath

Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML() and saveXML() methods).

! IT IS FAST AND IS VERY SAFE !

The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath is wellcome.

Example: retrive all paragraphs with class="fn":

  WITH needinfo AS (
    SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
    FROM t 
  ) SELECT array_to_string(frags,' ') AS my_p_fn2txt
    FROM needinfo
    WHERE array_length(frags , 1)>0
  -- for full content use xpath('//text()',xhtml)

regex solutions...

I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so safe.

I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.

 CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
     SELECT regexp_replace(
        regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), 
       E'(?x)(< [^>]*? >)', '', 'g')
 $$ LANGUAGE SQL;

See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.

like image 33
6 revs, 2 users 79% Avatar answered Sep 25 '22 20:09

6 revs, 2 users 79%