Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace HTML codes with equivalent characters in Java [duplicate]

Currently I'm working on converting HTML codes with equivalent characters in java. I need to convert the below code to characters.

è - è
®   - ®
& - &
ñ - ñ
&   - &

I tried using the regex pattern

(&#x)([\\d|\\w]*)([\\d|\\w]*)([\\d|\\w]*)([\\d|\\w]*)(;)

When I debug, matcher.find() gives me true but the control skips the loop where I have written the code for conversion. Don't know what is happening there.

Also, is there any way to optimize this regex?

Any help is appreciated.

Exception

java.lang.NumberFormatException: For input string: "x26"
      at java.lang.NumberFormatException.forInputString(Unknown Source)
      at java.lang.Integer.parseInt(Unknown Source)
      at java.lang.Integer.parseInt(Unknown Source)
      at org.apache.commons.lang.Entities.unescape(Entities.java:683)
      at org.apache.commons.lang.StringEscapeUtils.unescapeHtml(StringEscapeUtils.java:483)
like image 427
Raja Asthana Avatar asked Feb 21 '13 09:02

Raja Asthana


People also ask

What is Unescapehtml in Java?

unescapeHtml4() for this: Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes.

How do I use Escapehtml in Java?

In Java, we can use Apache commons-text , StringEscapeUtils. escapeHtml4(str) to escape HTML characters. In the old days, we usually use the Apache commons-lang3 , StringEscapeUtils class to escape HTML, but this class is deprecated as of 3.6.


2 Answers

Also, is there any way to optimize this regex?

Yes, don't use regex for this task, use Apache StringEscapeUtils from Apache commons lang:

import org.apache.commons.lang.StringEscapeUtils;
...
String withCharacters = StringEscapeUtils.unescapeHtml(yourString);

JavaDoc says:

Unescapes a string containing entity escapes to a string containing the actual Unicode characters corresponding to the escapes. Supports HTML 4.0 entities.

For example, the string "&lt;Fran&ccedil;ais&gt;" will become "<Français>"

If an entity is unrecognized, it is left alone, and inserted verbatim into the result string. e.g. "&gt;&zzzz;x" will become ">&zzzz;x".

like image 69
jlordo Avatar answered Sep 20 '22 12:09

jlordo


One of all the other possibilities or existing util methods could be spring-web's org.springframework.web.util.HtmlUtils.htmlUnescape.

Example usage in a self-contained Groovy script:

@Grapes(
    @Grab(group='org.springframework', module='spring-web', version='4.3.0.RELEASE')
)
import org.springframework.web.util.HtmlUtils

println HtmlUtils.htmlUnescape("La &#xE9;lite del tenis no teme al zika y jugar&#xE1; en R&#xED;o")
like image 39
Michal M Avatar answered Sep 18 '22 12:09

Michal M