I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for
div class = "classname"
in each line of HTML - This works, but I can't help but feel there is a better solution out there.
Is there any nice way where I could give a class a line of HTML and have some nice methods like:
boolean usesClass(String CSSClassname); String getText(); String getLink();
Its party trick is a CSS selector syntax to find elements, e.g.: String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc. </p></body></html>"; Document doc = Jsoup. parse(html); Elements links = doc.
In this article, I will focus on one of my favorites, jsoup, which was first released as open source in January 2010. It has been under active development since then by Jonathan Hedley, and the code uses the liberal MIT license.
Just call the method html2text with passing the html text and it will return plain text.
Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.
http://jsoup.org/
The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.
Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.
import java.io.IOException; import java.net.URL; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.TagNode; /** * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom> */ public class TestHtmlParse { static final String className = "tags"; static final String url = "http://www.stackoverflow.com"; TagNode rootNode; public TestHtmlParse(URL htmlPage) throws IOException { HtmlCleaner cleaner = new HtmlCleaner(); rootNode = cleaner.clean(htmlPage); } List getDivsByClass(String CSSClassname) { List divList = new ArrayList(); TagNode divElements[] = rootNode.getElementsByName("div", true); for (int i = 0; divElements != null && i < divElements.length; i++) { String classType = divElements[i].getAttributeByName("class"); if (classType != null && classType.equals(CSSClassname)) { divList.add(divElements[i]); } } return divList; } public static void main(String[] args) { try { TestHtmlParse thp = new TestHtmlParse(new URL(url)); List divs = thp.getDivsByClass(className); System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***"); for (Iterator iterator = divs.iterator(); iterator.hasNext();) { TagNode divElement = (TagNode) iterator.next(); System.out.println("Text child nodes of DIV: " + divElement.getText().toString()); } } catch(Exception e) { e.printStackTrace(); } } }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With