Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

jsoup : How to search for date text from a webpage

Simply this is what I am trying to do : (I want to use jsoup)

    1. pass only one url to parse
    2. search for date(s) which are mentioned inside the contents of web page
    3. Extracts at least one date from the each page contents
    4. convert that date into standard format

So, Point #1 What I have now :

String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Document document = Jsoup.connect(url).get();

Now here I want to understand what kind of format is "Document", is it parsed already from html or any type of web page type or what?

Then Point #2 What I have now:

Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = document.getElementsMatchingOwnText(p);

Here, I am trying to match a date regex to search for dates in the page and store in a string for later use(Point #3), but I am sure i am no near it, need help here.

I have done point #4.

So please anyone who can help me to understand and take me to the right direction how can I achieve those 4 points I mentioned above.

Thanks in Advance !

Updated : So here how I want :

public static void main(String[] args){
    try {
        // using USER AGENT for giving information to the server that I am a browser not a bot
        final String USER_AGENT =
                "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";

        // My only one url which I want to parse
        String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";

        // Creating a jsoup.Connection to connect the url with USER AGENT
        Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);

        // retrieving the parsed document
        Document htmlDocument = connection.get();

        /* Now till this part, I have A parsed document of the url page which is in plain-text format right?
         * If not, in which type or in which format it is stored in the variable 'htmlDocument'
         * */

        /* Now, If 'htmlDocument' holds the text format of the web page
         * Why do i need elements to find dates, because dates can be normal text in a web page,
         * So, how I am going to find an element tag for that?
         * As an example, If i wanted to collect text from <p> paragraph tag, 
         * I would use this : 
         */
        // I am not sure is it correct or not
        //***************************************************/
        Elements paragraph = htmlDocument.getElementsByTag("p");
        for(Element src: paragraph){
            System.out.println("text"+src.attr("abs:p"));
        }
       //***************************************************//

        /* But I do not want any elements to find to gather dates on the page
         * I just want to search the whole text document for date
         * So, I need a regex formatted date string which will be passed as a input for a search method
         * this search mechanism should be on text formatted page as we have parsed document in 'htmlDocument'
         */

        // At the end we will use only one date from our search result and format it in a standard form

        /*
         * That is it.
         */


        /*
         * I was trying something like this
         */
        //final Elements elements = document.getElementsMatchingOwnText("\\d{4}-\\d{2}-\\d{2}");
        Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
        Elements elements = htmlDocument.getElementsMatchingOwnText(p);

        for(Element e: elements){
            System.out.println("element = [" + e + "]");
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}
like image 554
Fahim Uddin Avatar asked Oct 29 '22 12:10

Fahim Uddin


1 Answers

Here is one possible solution i found:

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;

import java.util.List;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

/**
 * Created by ruben.alfarodiaz on 21/12/2016.
 */
@RunWith(JUnit4.class)
public class StackTest {

    @Test
    public void findDates() {
        final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
        try {
            String url = "http://stackoverflow.com/questions/51224/regular-expression-to-match-valid-dates";
            Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
            Document htmlDocument = connection.get();
            //with this pattern we can find all dates with regex dd/mm/yyyy if we need cover extra formats we should create N more patterns
            Pattern pattern = Pattern.compile("(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)");

            //Here we find all document elements which have some element with the searched pattern  
            Elements elements = htmlDocument.getElementsMatchingText(pattern);
            //in this loop we are going to filter from all original elements to find only the leaf elements
            List<Element> finalElements = elements.stream().filter(elem -> isLastElem(elem, pattern)).collect(Collectors.toList());
            finalElements.stream().forEach(elem ->
                System.out.println("Node: " + elem.html())
            );

        }catch(Exception ex){
            ex.printStackTrace();
        }
    }

    //Method to decide if the current element is a leaf or contains others dates inside  
    private boolean isLastElem(Element elem, Pattern pattern) {
        return elem.getElementsMatchingText(pattern).size() <= 1;
    }

}

The point should be added as many patterns as need because I think would be complex find a single pattern which matches all posibilities

Edit: The most important is that the library give you a hierarchy of elements so you need to itarete over them to find the final leaf. For instance

<html>
    <body>
        <div>
           20/11/2017    
        </div>
    </body>
</html>

If we find for the pattern dd/mm/yyyy the library will return 3 elements html, body and div, but we are just interested in div

like image 69
cralfaro Avatar answered Nov 15 '22 06:11

cralfaro