Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML/XML Parser for Java [closed]

What HTML parsers have the following features:

  • Fast
  • Thread-safe
  • Reliable and bug-free
  • Parses HTML and XML
  • Handles erroneous HTML
  • Has a DOM implementation
  • Supports HTML4, JavaScript, and CSS tags
  • Relatively simple, object-oriented API

What parser you think is better?

Thank you.

like image 449
Shayan Avatar asked Jan 24 '10 23:01

Shayan


People also ask

Does Java have built in XML parser?

Yes. Java contains javax. xml library. You can checkout some samples at Sun's Java API for XML Code Samples.

Can XML parser parse HTML?

You can try parsing an HTML file using a XML parser, but it's likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don't understand. XML parsers will fail to parse any HTML document that uses any of those features.

Which XML parser is best in Java for large files?

Although I agree that StAX is usually the best solution, there are situations in which SAX is better. If you have documents that contain large blocks of Text content, then AFAIR the StAX API will read those blocks of Text in memory entirely and handle that as a single event.


2 Answers

Check out Web Harvest. It's both a library you can use and a data extraction tool, which sounds to me that's exactly what you want to do. You create XML script files to instruct the scraper how to extract the information you need and from where. The provided GUI is very useful to quickly test the scripts.

Check out the project's samples page to see if it's a good fit for what you are trying to do.

like image 139
Cesar Avatar answered Oct 02 '22 07:10

Cesar


The best known are NekoHTML and JTidy.

NekoHTML is based on Xerces, and provides a simple adaptable SAXParser which implements XMLReader JavaSE interface.

JTidy is more intented into formatting your html code into something XML-valid, but is still very useful as an XML parser, producing a DOM tree if needed.

You could have a look at this list for other alternatives.

Another choice could be to use hpricot through jRuby.

like image 42
Valentin Rocher Avatar answered Oct 02 '22 09:10

Valentin Rocher