Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing Specific HTML Tags with CFML

I need a regex to remove all instances of <FONT> and any properties it might have inside it, like <FONT size=2 face=Verdana> and its closing tag </FONT>. the string i get back, the font tag can contain any property and different variations of values for those properties, and the html structure is not consistent. this is one example of what i get as a string:

<UL>
    <LI><FONT size=2 face=Verdana>random text<STRONG>random text</STRONG>random text<SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; FONT-SIZE: 11pt; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"><SPAN style="mso-spacerun: yes">&nbsp;</SPAN>random text</SPAN> </FONT></LI>
    <LI><FONT size=2 face=Verdana><FONT size=2 face=Verdana><STRONG>random text</STRONG></FONT></LI> <LI>random text</FONT></LI>
    <LI><FONT size=2 face=Verdana>random text</FONT></LI>
    <LI><FONT size=2 face=Verdana>random text</FONT></LI>

and this is what i would like it to look like after using the regex:

<UL>
    <LI>random text<STRONG>random text</STRONG>random text<SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; FONT-SIZE: 11pt; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"><SPAN style="mso-spacerun: yes">&nbsp;</SPAN>random text</SPAN></LI>
    <LI><STRONG>random text</STRONG></LI>
    <LI>random text</LI>
    <LI>random text</LI>
    <LI>random text</LI>

I have tried different variations and I've been able to remove the <FONT part but not its properties, the ending >, or the closing tag </FONT>

This an example of what I'm using

loc.result = rereplace(arguments.htmlString, "\\<FONT[^*\\>", "", "ALL");

I apologize for my bad regex code, so any hints or suggestions would be greatly appreciated!

like image 767
Phillip Pantaleano Avatar asked Jan 02 '23 08:01

Phillip Pantaleano


1 Answers

As written by others before, don't use REGEX for that. Use an HTML parser like JSoup.

Download the JSoup jar file and save it somewhere on your classpath, and then use the following function (cfscript syntax, tested with Lucee, but should work with any CFML engine):

<cfscript>
/** removes the given tag from the input html while keeping its contents */ 
function removeTag(input, tagname){

    var Jsoup = createObject("java", "org.jsoup.Jsoup");
    var doc   = Jsoup.parse(arguments.input);
    var body  = doc.body().child(0);
    var tags  = body.select(arguments.tagname);

    for (var tag in tags){
        for (var attr in tag.attributes().asList())
            tag.removeAttr(attr.getKey());
    }

    var result = body.toString();
    result = replace(result, "<#arguments.tagname#>",  "", "all");
    result = replace(result, "</#arguments.tagname#>", "", "all");

    return result;
}
</cfscript>

Then just call the function with the HTML code that you want to clean, e.g.:

cleanHtml = removeTag(inputHtml, "font");

To test your example, I added the following:

<cfsavecontent variable="input">
<UL>
    <LI><FONT size=2 face=Verdana>random text 1<STRONG>random text 2</STRONG>random text 3<SPAN style="LINE-HEIGHT: 115%; FONT-FAMILY: 'Arial','sans-serif'; FONT-SIZE: 11pt; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA"><SPAN style="mso-spacerun: yes">&nbsp;</SPAN>random text 4</SPAN> </FONT></LI>
    <LI><FONT size=2 face=Verdana><FONT size=2 face=Verdana><STRONG>random text 5</STRONG></FONT></LI> <LI>random text 5</FONT></LI>
    <LI><FONT size=2 face=Verdana>random text 6</FONT></LI>
    <LI><FONT size=2 face=Verdana>random text 7</FONT></LI>
</cfsavecontent>

<cfdump var="#{ output: removeTag(input, "font"), input: input }#">

And the output is as follows:

enter image description here

I recommend also reading my blog post Harnessing the Power of Java in CFML

like image 140
isapir Avatar answered Jan 09 '23 21:01

isapir