What is the best way to screen scrape poorly formed XHTML pages for a java app

Question

I want to be able to grab content from web pages, especially the tags and the content within them. I have tried XQuery and XPath but they don't seem to work for malformed XHTML and REGEX is just a pain.

Is there a better solution. Ideally I would like to be able to ask for all the links and get back an array of URLs, or ask for the text of the links and get back an array of Strings with the text of the links, or ask for all the bold text etc.

Jay Kominek · Accepted Answer

Run the XHTML through something like JTidy, which should give you back valid XML.

What is the best way to screen scrape poorly formed XHTML pages for a java app

Tags:

java

regex

xpath

screen-scraping

xquery

Ankur

1 Answers

Jay Kominek

Recent Activity

Donate For Us

What is the best way to screen scrape poorly formed XHTML pages for a java app

Tags:

java

regex

xpath

screen-scraping

xquery

Ankur

1 Answers

Jay Kominek

Related questions

Recent Activity

Donate For Us