Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a library similar to lxml or nokogiri for Java? [closed]

I want to do some screen scraping, ideally using CSS selectors and not XPath. Is there a library similar to ones in Ruby or Python?

like image 416
VoY Avatar asked Jan 23 '10 10:01

VoY


2 Answers

There are dozen of screen scraping library written in Java. Just to cite a few :

  • TagSoup - a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.
  • Jericho HTML Parser - Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions. t is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
  • HTML Cleaner - HtmlCleaner reorders individual elements and produces well-formed XML from dirty HTML. It follows similar rules that the most of web-browsers use in order to create document object model. A user may provide custom tag and rule set for tag filtering and balancing.
  • NekoHTML - NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

And many more at HTML Screen Scraping Tools written in Java. But these are IMO the best to deal with any kind of content (understand all kind of crap) as I mentioned in this previous answer. This might not be an issue for you though.

Just in case, maybe check out the thread Nokogiri pure Java status.

Update: A new project has been released (the 2010-01-31), jsoup, which offers a selector-syntax to find elements. See its website for more details and/or this answer from its author.

like image 178
Pascal Thivent Avatar answered Oct 27 '22 09:10

Pascal Thivent


You could use hpricot through jRuby. See this SO question for more details about it.

like image 21
Valentin Rocher Avatar answered Oct 27 '22 08:10

Valentin Rocher