Java HTML Parsing [closed]

Tags:

I'm working on an app which scrapes data from a website and I was wondering how I should go about getting the data. Specifically I need data contained in a number of div tags which use a specific CSS class - Currently (for testing purposes) I'm just checking for

Click to copy

div class = "classname"

in each line of HTML - This works, but I can't help but feel there is a better solution out there.

Is there any nice way where I could give a class a line of HTML and have some nice methods like:

Click to copy

boolean usesClass(String CSSClassname); String getText(); String getLink();

724

asked Oct 26 '08 13:10

Richard Walton

2 Answers

Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.

http://jsoup.org/

192

answered Sep 21 '22 17:09

rajsite

The main problem as stated by preceding coments is malformed HTML, so an html cleaner or HTML-XML converter is a must. Once you get the XML code (XHTML) there are plenty of tools to handle it. You could get it with a simple SAX handler that extracts only the data you need or any tree-based method (DOM, JDOM, etc.) that let you even modify original code.

Here is a sample code that uses HTML cleaner to get all DIVs that use a certain class and print out all Text content inside it.

Click to copy

import java.io.IOException; import java.net.URL; import java.util.ArrayList; import java.util.Iterator; import java.util.List;  import org.htmlcleaner.HtmlCleaner; import org.htmlcleaner.TagNode;  /**  * @author Fernando Miguélez Palomo <fernandoDOTmiguelezATgmailDOTcom>  */ public class TestHtmlParse {     static final String className = "tags";     static final String url = "http://www.stackoverflow.com";      TagNode rootNode;      public TestHtmlParse(URL htmlPage) throws IOException     {         HtmlCleaner cleaner = new HtmlCleaner();         rootNode = cleaner.clean(htmlPage);     }      List getDivsByClass(String CSSClassname)     {         List divList = new ArrayList();          TagNode divElements[] = rootNode.getElementsByName("div", true);         for (int i = 0; divElements != null && i < divElements.length; i++)         {             String classType = divElements[i].getAttributeByName("class");             if (classType != null && classType.equals(CSSClassname))             {                 divList.add(divElements[i]);             }         }          return divList;     }      public static void main(String[] args)     {         try         {             TestHtmlParse thp = new TestHtmlParse(new URL(url));              List divs = thp.getDivsByClass(className);             System.out.println("*** Text of DIVs with class '"+className+"' at '"+url+"' ***");             for (Iterator iterator = divs.iterator(); iterator.hasNext();)             {                 TagNode divElement = (TagNode) iterator.next();                 System.out.println("Text child nodes of DIV: " + divElement.getText().toString());             }         }         catch(Exception e)         {             e.printStackTrace();         }     } }

answered Sep 25 '22 17:09

Fernando Miguélez

Related questions
                            
                                What exactly is an instance in Java?
                            
                                Can I get all methods of a class?
                            
                                Should Helper/Utility Classes be abstract?
                            
                                Intellij Code Completion for all setter/getter methods of local variable object
                            
                                java.util.Date format SSSSSS: if not microseconds what are the last 3 digits?
                            
                                Unable to autowire the service inside my authentication filter in Spring
                            
                                'No JUnit tests found' in Eclipse
                            
                                Executable war file that starts jetty without maven
                            
                                JOptionPane to get password
                            
                                Android Button Onclick
                            
                                Gson to HashMap
                            
                                Java - Access is denied java.io.FileNotFoundException [duplicate]
                            
                                Zookeeper error: Cannot open channel to X at election address
                            
                                Cannot Autowire Service in HandlerInterceptorAdapter [duplicate]
                            
                                Group by and sum objects like in SQL with Java lambdas?
                            
                                Is there a CRUD generator utility in Java(any framework) like Scaffolding in Rails? [closed]
                            
                                Java Reflection: Why is it so slow?
                            
                                How can I convert POI HSSFWorkbook to bytes?
                            
                                openjdk 1.7 in eclipse: operator is not allowed for source level below 1.7
                            
                                ResultSet: Retrieving column values by index versus retrieving by label

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java HTML Parsing [closed]

Tags:

java

html

parsing

web-scraping

Richard Walton

People also ask

2 Answers

rajsite

Fernando Miguélez

Recent Activity

Donate For Us