Simply this is what I am trying to do : (I want to use jsoup)
So, Point #1 What I have now :
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
Document document = Jsoup.connect(url).get();
Now here I want to understand what kind of format is "Document", is it parsed already from html or any type of web page type or what?
Then Point #2 What I have now:
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = document.getElementsMatchingOwnText(p);
Here, I am trying to match a date regex to search for dates in the page and store in a string for later use(Point #3), but I am sure i am no near it, need help here.
I have done point #4.
So please anyone who can help me to understand and take me to the right direction how can I achieve those 4 points I mentioned above.
Thanks in Advance !
Updated : So here how I want :
public static void main(String[] args){
try {
// using USER AGENT for giving information to the server that I am a browser not a bot
final String USER_AGENT =
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
// My only one url which I want to parse
String url = "http://stackoverflow.com/questions/28149254/using-a-regex-in-jsoup";
// Creating a jsoup.Connection to connect the url with USER AGENT
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
// retrieving the parsed document
Document htmlDocument = connection.get();
/* Now till this part, I have A parsed document of the url page which is in plain-text format right?
* If not, in which type or in which format it is stored in the variable 'htmlDocument'
* */
/* Now, If 'htmlDocument' holds the text format of the web page
* Why do i need elements to find dates, because dates can be normal text in a web page,
* So, how I am going to find an element tag for that?
* As an example, If i wanted to collect text from <p> paragraph tag,
* I would use this :
*/
// I am not sure is it correct or not
//***************************************************/
Elements paragraph = htmlDocument.getElementsByTag("p");
for(Element src: paragraph){
System.out.println("text"+src.attr("abs:p"));
}
//***************************************************//
/* But I do not want any elements to find to gather dates on the page
* I just want to search the whole text document for date
* So, I need a regex formatted date string which will be passed as a input for a search method
* this search mechanism should be on text formatted page as we have parsed document in 'htmlDocument'
*/
// At the end we will use only one date from our search result and format it in a standard form
/*
* That is it.
*/
/*
* I was trying something like this
*/
//final Elements elements = document.getElementsMatchingOwnText("\\d{4}-\\d{2}-\\d{2}");
Pattern p = Pattern.compile("\\d{4}-[01]\\d-[0-3]\\d", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Elements elements = htmlDocument.getElementsMatchingOwnText(p);
for(Element e: elements){
System.out.println("element = [" + e + "]");
}
} catch (IOException e) {
e.printStackTrace();
}
}
Here is one possible solution i found:
import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
import java.util.List;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
/**
* Created by ruben.alfarodiaz on 21/12/2016.
*/
@RunWith(JUnit4.class)
public class StackTest {
@Test
public void findDates() {
final String USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1";
try {
String url = "http://stackoverflow.com/questions/51224/regular-expression-to-match-valid-dates";
Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
Document htmlDocument = connection.get();
//with this pattern we can find all dates with regex dd/mm/yyyy if we need cover extra formats we should create N more patterns
Pattern pattern = Pattern.compile("(0?[1-9]|[12][0-9]|3[01])/(0?[1-9]|1[012])/((19|20)\\d\\d)");
//Here we find all document elements which have some element with the searched pattern
Elements elements = htmlDocument.getElementsMatchingText(pattern);
//in this loop we are going to filter from all original elements to find only the leaf elements
List<Element> finalElements = elements.stream().filter(elem -> isLastElem(elem, pattern)).collect(Collectors.toList());
finalElements.stream().forEach(elem ->
System.out.println("Node: " + elem.html())
);
}catch(Exception ex){
ex.printStackTrace();
}
}
//Method to decide if the current element is a leaf or contains others dates inside
private boolean isLastElem(Element elem, Pattern pattern) {
return elem.getElementsMatchingText(pattern).size() <= 1;
}
}
The point should be added as many patterns as need because I think would be complex find a single pattern which matches all posibilities
Edit: The most important is that the library give you a hierarchy of elements so you need to itarete over them to find the final leaf. For instance
<html>
<body>
<div>
20/11/2017
</div>
</body>
</html>
If we find for the pattern dd/mm/yyyy the library will return 3 elements html, body and div, but we are just interested in div
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With