Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java HTML Parsing [closed]

I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for

div class = "classname" 

in each line of HTML - This works, but I can't help but feel there is a better solution out there.

Is there any nice way where I could give a class a line of HTML and have some nice methods like:

boolean usesClass(String CSSClassname); String getText(); String getLink(); 
like image 724
Richard Walton Avatar asked Oct 26 '08 13:10

Richard Walton


People also ask

How do you process HTML in Java?

Its party trick is a CSS selector syntax to find elements, e.g.: String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc. </p></body></html>"; Document doc = Jsoup. parse(html); Elements links = doc.

Is jsoup open source?

In this article, I will focus on one of my favorites, jsoup, which was first released as open source in January 2010. It has been under active development since then by Jonathan Hedley, and the code uses the liberal MIT license.

How do I convert HTML text to normal text in Java?

Just call the method html2text with passing the html text and it will return plain text.


2 Answers

Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.

http://jsoup.org/

like image 192
rajsite Avatar answered Sep 21 '22 17:09

rajsite


The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.

Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.

import java.io.IOException; import java.net.URL; import java.util.ArrayList; import java.util.Iterator; import java.util.List;  import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.TagNode;  /**  * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>  */ public class TestHtmlParse {     static final String className = "tags";     static final String url = "http://www.stackoverflow.com";      TagNode rootNode;      public TestHtmlParse(URL htmlPage) throws IOException     {         HtmlCleaner cleaner = new HtmlCleaner();         rootNode = cleaner.clean(htmlPage);     }      List getDivsByClass(String CSSClassname)     {         List divList = new ArrayList();          TagNode divElements[] = rootNode.getElementsByName("div", true);         for (int i = 0; divElements != null && i < divElements.length; i++)         {             String classType = divElements[i].getAttributeByName("class");             if (classType != null && classType.equals(CSSClassname))             {                 divList.add(divElements[i]);             }         }          return divList;     }      public static void main(String[] args)     {         try         {             TestHtmlParse thp = new TestHtmlParse(new URL(url));              List divs = thp.getDivsByClass(className);             System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");             for (Iterator iterator = divs.iterator(); iterator.hasNext();)             {                 TagNode divElement = (TagNode) iterator.next();                 System.out.println("Text child nodes of DIV: " + divElement.getText().toString());             }         }         catch(Exception e)         {             e.printStackTrace();         }     } } 
like image 25
Fernando Miguélez Avatar answered Sep 25 '22 17:09

Fernando Miguélez