how to decode html codes using Java? [duplicate]

Q: How do I decode a string in HTML?

The HTML entities here are &lt; that is the < character, &quot; that is the double-quote character and &gt; that is the > character. In this example, we unescape obfuscated HTML code. Every character in the input string is an HTML entity and in the output, you get a decoded string made out of English letters.

Tags:

java

html

regex

decode

Possible Duplicate:
Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?

I need to extract paragraphs (like title in StackOverflow) from an html file.

I can use regular expressions in Java to extract the fields I need but I have to decode the fields obtained.

EXAMPLE

field extracted:

Paging Lucene&#39s search results (with **;** among **&#39** and **s**)

field after decoding:

Paging Lucene's search results

Is there any class in java that will allow me to convert these html codes?

254

asked Dec 06 '12 18:12

user

2 Answers

Use methods provided by Apache Commons Lang

import org.apache.commons.lang.StringEscapeUtils;
// ...
String afterDecoding = StringEscapeUtils.unescapeHtml(beforeDecoding);

168

answered Sep 21 '22 13:09

jlordo

Do not try to solve everything by regexp.

While you can do some parts - such as replacing entities, the much better approach is to actually use a (robust) HTML parser.

See this question: RegEx match open tags except XHTML self-contained tags for why this is a bad idea to do with the regexp swiss army chainsaw. Seriously, read this question and the top answer, it is a stack overflow highlight!

Chuck Norris can parse HTML with regex.

The bad news is: there is more than one way to encode characters.

https://en.wikipedia.org/wiki/Character_encodings_in_HTML

For example, the character 'λ' can be represented as λ, λ or &#X03bb;

And if you are really unlucky, some web site relies on some browsers capabilities to guess character meanings.  for example is not valid, yet many browsers will interpret it as ™.

Clearly it is a good idea to leave this to a dedicated library instead of trying to hack a custom regular expression yourself.

So I strongly recommend:

Feed string into a robust HTML parser
Get parsed (and fully decoded) string back

answered Sep 22 '22 13:09

Has QUIT--Anony-Mousse

Related questions
                            
                                Making a line of code difficult to read
                            
                                Java equals() ordering
                            
                                why is bufferedwriter not writing in the file?
                            
                                AlertDialog style buttons for an Activity
                            
                                Can a secret be hidden in a 'safe' java class offering access credentials?
                            
                                Android game rpg inventory system
                            
                                Is it possible to use struct-like constructs in Java?
                            
                                How to sort HashMap based on Date? [duplicate]
                            
                                How to program without side-effects in Java?
                            
                                Does it make sense to self check for null in Java [closed]
                            
                                OSMDroid PathOverlay
                            
                                How to get Resource(int) from String - Android [duplicate]
                            
                                JList.getModel() ClassCastException
                            
                                Java Code for permutations of a list of numbers
                            
                                Does the best practice of 'programming to interfaces' apply to local variables?
                            
                                how to unproxy a hibernate object [duplicate]
                            
                                Case Insensitive variable for String replaceAll(,) method Java
                            
                                JUnit throws java.lang.NoSuchMethodError For com.google.common.collect.Iterables.tryFind
                            
                                Getting JVM error after SOAP UI installation
                            
                                Parsing raw HTTP Request

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With