<p>I am trying to take a string that has HTML, strip out some tags (img, object) and all other HTML tags, strip out their attributes. For example:</p> <pre class="prettyprint"><code><div id="someId" style="color: #000000"> <p class="someClass">Some Text</p> <img src="images/someimage.jpg" alt="" /> <a href="somelink.html">Some Link Text</a> </div> </code></pre> <p>Would become:</p> <pre class="prettyprint"><code><div> <p>Some Text</p> Some Link Text </div> </code></pre> <p>I am trying:</p> <pre class="prettyprint"><code>string.replaceAll("<\/?[img|object](\s\w+(\=\".*\")?)*\>", ""); //REMOVE img/object </code></pre> <p>I am not sure how to strip all attributes inside a tag though.</p> <p>Any help would be appreciated.</p> <p>Thanks.</p>

<p>I would not recommend regex for this if you want to filter specific tags. This is going to be hell of a job and never going to be fully reliable. Use a normal HTML parser like Jsoup. It offers the <code>Whitelist</code> API to clean up HTML. See also this cookbook document.</p> <p>Here's a kickoff example with help of Jsoup which only allows <code><div></code> and <code><p></code> tags next to the standard set of tags of the chosen <code>Whitelist</code> which is <code>Whitelist#simpleText()</code> in the below example.</p> <pre class="prettyprint lang-java prettyprint-override"><code>String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>"; Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean. whitelist.addTags("div", "p"); String clean = Jsoup.clean(html, whitelist); System.out.println(clean); </code></pre> <p>This results in</p> <pre class="prettyprint lang-html prettyprint-override"><code><div> <p>Some Text</p>Some Link Text </div> </code></pre> <h3>See also:</h3> <ul> <li>How to implement a possibility for user to post some html-formatted data in a safe way?</li> </ul>

<p>You can remove all attributes like this:</p> <pre class="prettyprint"><code>string.replaceAll("(<\\w+)[^>]*(>)", "$1$2"); </code></pre> <p>This expression matches an opening tag, but captures only its header <code><div</code> and the closing <code>></code> as groups 1 and 2. <code>replaceAll</code> uses <em>references</em> to these groups to join them back in the output as <code>$1$2</code>. This cuts out the attributes in the middle of the tag.</p>

How would I remove all HTML attributes in HTML tags in a string

Tags:

java

regex

html-parsing

I am trying to take a string that has HTML, strip out some tags (img, object) and all other HTML tags, strip out their attributes. For example:

<div id="someId" style="color: #000000">
   <p class="someClass">Some Text</p>
   <img src="images/someimage.jpg" alt="" />
   <a href="somelink.html">Some Link Text</a>
</div>

Would become:

<div>
   <p>Some Text</p>
   Some Link Text
</div>

I am trying:

string.replaceAll("<\/?[img|object](\s\w+(\=\".*\")?)*\>", ""); //REMOVE img/object

I am not sure how to strip all attributes inside a tag though.

Any help would be appreciated.

Thanks.

561

asked Feb 23 '12 15:02

fanfavorite

2 Answers

I would not recommend regex for this if you want to filter specific tags. This is going to be hell of a job and never going to be fully reliable. Use a normal HTML parser like Jsoup. It offers the Whitelist API to clean up HTML. See also this cookbook document.

Here's a kickoff example with help of Jsoup which only allows <div> and <p> tags next to the standard set of tags of the chosen Whitelist which is Whitelist#simpleText() in the below example.

String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>";
Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean.
whitelist.addTags("div", "p");
String clean = Jsoup.clean(html, whitelist);
System.out.println(clean);

This results in

<div>
   <p>Some Text</p>Some Link Text
</div>

Related questions
                            
                                A twisted inner class in Java
                            
                                Resizing an image in swing
                            
                                Sorting by value in Hadoop from a file
                            
                                Get shown component in JScrollPane
                            
                                Need to send a UDP packet and receive a response in Java
                            
                                do-while loops with continue and with and without a label in Java
                            
                                JAVA threads (different stacks) synchronization
                            
                                Can't use JSONObject in Google App Engine (Java) anymore?
                            
                                Generate XML Schema from Java class (or the opposite)
                            
                                Bypass java exception specification...?
                            
                                Java immutable strings confusion
                            
                                Some queries regarding fetch strategies in hibernate and relation of fetchtype with fetchmode?
                            
                                Caching web applications in Java
                            
                                Scala returns different types for very similar expressions
                            
                                How to Write an Equality Method in Java
                            
                                Quickly compare a string against a Collection in Java
                            
                                Dynamically removing component from JPanel
                            
                                Algorithm for Converting one word to other word by changing each letter per iteration which should form an another meaningful word?
                            
                                Joda Time minutes in a duration or interval
                            
                                Converting Java collections to Clojure data structures

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How would I remove all HTML attributes in HTML tags in a string

Tags:

java

regex

html-parsing

fanfavorite

People also ask

2 Answers

See also:

BalusC

Sergey Kalinichenko

Recent Activity

Donate For Us