Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best way to screen scrape poorly formed XHTML pages for a java app

I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain.

Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text of the links and get back an array of Strings with the text of the links, or ask for all the bold text etc.

like image 630
Ankur Avatar asked Dec 30 '22 02:12

Ankur


1 Answers

Run the XHTML through something like JTidy, which should give you back valid XML.

like image 150
Jay Kominek Avatar answered Jan 14 '23 13:01

Jay Kominek