How can I efficiently parse HTML with Java?

Tags:

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();

See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

The best I've seen so far is HtmlCleaner:

HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

With HtmlCleaner you can locate any element using XPath.

For other html parsers see this SO question.

I suggest Validator.nu's parser, based on the HTML5 parsing algorithm. It is the parser used in Mozilla from 2010-05-03

Related questions
                            
                                Synchronization vs Lock
                            
                                Picking a random element from a set
                            
                                Regex Named Groups in Java
                            
                                com.jcraft.jsch.JSchException: UnknownHostKey
                            
                                Unfinished Stubbing Detected in Mockito
                            
                                Jackson - Deserialize using generic class
                            
                                Command-line Tool to find Java Heap Size and Memory Used (Linux)?
                            
                                How to get a resource id with a known resource name?
                            
                                Tomcat: How to find out running Tomcat version?
                            
                                How to install a specific JDK on Mac OS X?
                            
                                Difference between List, List<?>, List<T>, List<E>, and List<Object>
                            
                                Use of Java's Collections.singletonList()?
                            
                                Javadoc @see or {@link}?
                            
                                Unrecognized SSL message, plaintext connection? Exception
                            
                                What Java 8 Stream.collect equivalents are available in the standard Kotlin library?
                            
                                String.format() to format double in Java
                            
                                How can I parse a local JSON file from assets folder into a ListView?
                            
                                Bytes of a string in Java
                            
                                Why can't strings be mutable in Java and .NET?
                            
                                Rename a file using Java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I efficiently parse HTML with Java?

Tags:

java

html

parsing

html-parsing

web-scraping

Recent Activity

Donate For Us