Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How parse quasi-html text in java?

Tags:

java

parsing

The quasi html text, looks like: Simple<br> text <b>simple</b> text simple <BR><BR>text simple text, I would like to parse it and create dom document. But problem is with unclosed tags, when I try this:

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource source = new InputSource(new StringReader(
Document doc = builder.parse(source);

Error occurs: org.xml.sax.SAXParseException; The element type "br" must be terminated by the matching end-tag

I don't want replace all <br> by <br></br>, any solution or advice?

like image 238
tostao Avatar asked Aug 01 '13 08:08

tostao


2 Answers

Use jsoup and enjoy the ease of use.

like image 167
Michael-O Avatar answered Nov 11 '22 05:11

Michael-O


You must rewrite all well formed HTML. Basically you go through the text and create a List of all opening tags. When you find a corresponding closing tag, you can remove it from the list. When you are through, and you still have entries in this List, you know its not well formed.

The problem is where to insert the unclosed Tags. You can try to insert a corresponding closing tag, right after the next word. In your case you can simply replace the br tag by br / tag, if thats the only occurence. This way string represntes the document's content.

string = string.replace("<br>", "<br />");
like image 3
Stimpson Cat Avatar answered Nov 11 '22 04:11

Stimpson Cat