I am trying to take a string that has HTML, strip out some tags (img, object) and all other HTML tags, strip out their attributes. For example:
<div id="someId" style="color: #000000">
<p class="someClass">Some Text</p>
<img src="images/someimage.jpg" alt="" />
<a href="somelink.html">Some Link Text</a>
</div>
Would become:
<div>
<p>Some Text</p>
Some Link Text
</div>
I am trying:
string.replaceAll("<\/?[img|object](\s\w+(\=\".*\")?)*\>", ""); //REMOVE img/object
I am not sure how to strip all attributes inside a tag though.
Any help would be appreciated.
Thanks.
To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.
The strip_tags() function strips a string from HTML, XML, and PHP tags.
Approach: Select the HTML element which need to remove. Use JavaScript remove() and removeChild() method to remove the element from the HTML document.
To remove all attributes of elements, we use removeAttributeNode() method.
I would not recommend regex for this if you want to filter specific tags. This is going to be hell of a job and never going to be fully reliable. Use a normal HTML parser like Jsoup. It offers the Whitelist
API to clean up HTML. See also this cookbook document.
Here's a kickoff example with help of Jsoup which only allows <div>
and <p>
tags next to the standard set of tags of the chosen Whitelist
which is Whitelist#simpleText()
in the below example.
String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>";
Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean.
whitelist.addTags("div", "p");
String clean = Jsoup.clean(html, whitelist);
System.out.println(clean);
This results in
<div>
<p>Some Text</p>Some Link Text
</div>
You can remove all attributes like this:
string.replaceAll("(<\\w+)[^>]*(>)", "$1$2");
This expression matches an opening tag, but captures only its header <div
and the closing >
as groups 1 and 2. replaceAll
uses references to these groups to join them back in the output as $1$2
. This cuts out the attributes in the middle of the tag.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With