How parse quasi-html text in java?

Question

The quasi html text, looks like: Simple text simple text simple text simple text, I would like to parse it and create dom document. But problem is with unclosed tags, when I try this:

DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource source = new InputSource(new StringReader(
Document doc = builder.parse(source);

Error occurs: org.xml.sax.SAXParseException; The element type "br" must be terminated by the matching end-tag

I don't want replace all   by  , any solution or advice?

Michael-O · Accepted Answer

Use jsoup and enjoy the ease of use.

Stimpson Cat · Answer

You must rewrite all well formed HTML. Basically you go through the text and create a List of all opening tags. When you find a corresponding closing tag, you can remove it from the list. When you are through, and you still have entries in this List, you know its not well formed.

The problem is where to insert the unclosed Tags. You can try to insert a corresponding closing tag, right after the next word. In your case you can simply replace the br tag by br / tag, if thats the only occurence. This way string represntes the document's content.

string = string.replace("<br>", "<br />");

How parse quasi-html text in java?

Tags:

java

parsing

tostao

2 Answers

Michael-O

Stimpson Cat

Recent Activity

Donate For Us

How parse quasi-html text in java?

Tags:

java

parsing

tostao

2 Answers

Michael-O

Stimpson Cat

Related questions

Recent Activity

Donate For Us