Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would I remove all HTML attributes in HTML tags in a string

I am trying to take a string that has HTML, strip out some tags (img, object) and all other HTML tags, strip out their attributes. For example:

<div id="someId" style="color: #000000">
   <p class="someClass">Some Text</p>
   <img src="images/someimage.jpg" alt="" />
   <a href="somelink.html">Some Link Text</a>
</div>

Would become:

<div>
   <p>Some Text</p>
   Some Link Text
</div>

I am trying:

string.replaceAll("<\/?[img|object](\s\w+(\=\".*\")?)*\>", ""); //REMOVE img/object

I am not sure how to strip all attributes inside a tag though.

Any help would be appreciated.

Thanks.

like image 561
fanfavorite Avatar asked Feb 23 '12 15:02

fanfavorite


People also ask

How do I remove all tags from a string?

To strip out all the HTML tags from a string there are lots of procedures in JavaScript. In order to strip out tags we can use replace() function and can also use . textContent property, . innerText property from HTML DOM.

Which tag is used to remove all HTML tags from a string?

The strip_tags() function strips a string from HTML, XML, and PHP tags.

How do you remove HTML tags in HTML?

Approach: Select the HTML element which need to remove. Use JavaScript remove() and removeChild() method to remove the element from the HTML document.

How do I delete all attributes?

To remove all attributes of elements, we use removeAttributeNode() method.


2 Answers

I would not recommend regex for this if you want to filter specific tags. This is going to be hell of a job and never going to be fully reliable. Use a normal HTML parser like Jsoup. It offers the Whitelist API to clean up HTML. See also this cookbook document.

Here's a kickoff example with help of Jsoup which only allows <div> and <p> tags next to the standard set of tags of the chosen Whitelist which is Whitelist#simpleText() in the below example.

String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>";
Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean.
whitelist.addTags("div", "p");
String clean = Jsoup.clean(html, whitelist);
System.out.println(clean);

This results in

<div>
   <p>Some Text</p>Some Link Text
</div>

See also:

  • How to implement a possibility for user to post some html-formatted data in a safe way?
like image 35
BalusC Avatar answered Sep 24 '22 22:09

BalusC


You can remove all attributes like this:

string.replaceAll("(<\\w+)[^>]*(>)", "$1$2");

This expression matches an opening tag, but captures only its header <div and the closing > as groups 1 and 2. replaceAll uses references to these groups to join them back in the output as $1$2. This cuts out the attributes in the middle of the tag.

like image 156
Sergey Kalinichenko Avatar answered Sep 25 '22 22:09

Sergey Kalinichenko