Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML page to XHTML with TagSoup

Sorry if this is too simple, but I simply couldn't find a tutorial nor the documentation of the Java version of TagSoup.

Basically I want to download an HTML webpage from the internet and turn it into XHTML, contained in a string. How can I do this with TagSoup?

Thanks!

like image 273
konr Avatar asked Jan 23 '23 02:01

konr


1 Answers

Something like this:

wget -O - example.com/bad.html | java -jar tagsoup.jar

Or, from Java:

To parse HTML:

  • Create an instance of org.ccil.cowan.tagsoup.Parser
  • Provide your own SAX2 ContentHandler
  • Provide an InputSource referring to the HTML
  • And parse()!
like image 122
Pascal Thivent Avatar answered Jan 29 '23 07:01

Pascal Thivent