It's working fine over HTTP, but when I try and use an HTTPS source it throws the following exception:
10-12 13:22:11.169: WARN/System.err(332): javax.net.ssl.SSLHandshakeException: java.security.cert.CertPathValidatorException: Trust anchor for certification path not found. 10-12 13:22:11.179: WARN/System.err(332): at org.apache.harmony.xnet.provider.jsse.OpenSSLSocketImpl.startHandshake(OpenSSLSocketImpl.java:477) 10-12 13:22:11.179: WARN/System.err(332): at org.apache.harmony.xnet.provider.jsse.OpenSSLSocketImpl.startHandshake(OpenSSLSocketImpl.java:328) 10-12 13:22:11.179: WARN/System.err(332): at org.apache.harmony.luni.internal.net.www.protocol.http.HttpConnection.setupSecureSocket(HttpConnection.java:185) 10-12 13:22:11.179: WARN/System.err(332): at org.apache.harmony.luni.internal.net.www.protocol.https.HttpsURLConnectionImpl$HttpsEngine.makeSslConnection(HttpsURLConnectionImpl.java:433) 10-12 13:22:11.189: WARN/System.err(332): at org.apache.harmony.luni.internal.net.www.protocol.https.HttpsURLConnectionImpl$HttpsEngine.makeConnection(HttpsURLConnectionImpl.java:378) 10-12 13:22:11.189: WARN/System.err(332): at org.apache.harmony.luni.internal.net.www.protocol.http.HttpURLConnectionImpl.connect(HttpURLConnectionImpl.java:205) 10-12 13:22:11.189: WARN/System.err(332): at org.apache.harmony.luni.internal.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:152) 10-12 13:22:11.189: WARN/System.err(332): at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:377) 10-12 13:22:11.189: WARN/System.err(332): at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364) 10-12 13:22:11.189: WARN/System.err(332): at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
Here's the relevant code:
try { doc = Jsoup.connect("https url here").get(); } catch (IOException e) { Log.e("sys","coudnt get the html"); e.printStackTrace(); }
What It Is. jsoup can parse HTML files, input streams, URLs, or even strings. It eases data extraction from HTML by offering Document Object Model (DOM) traversal methods and CSS and jQuery-like selectors. jsoup can manipulate the content: the HTML element itself, its attributes, or its text.
Jsoup parses the source code as delivered from the server (or in this case loaded from file). It does not invoke client-side actions such as JavaScript or CSS DOM manipulation.
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. jsoup implements the WHATWG HTML specification, and parses HTML to the same DOM as modern browsers do.
If you want to do it the right way, and/or you need to deal with only one site, then you basically need to grab the SSL certificate of the website in question and import it in your Java key store. This will result in a JKS file which you in turn set as SSL trust store before using Jsoup (or java.net.URLConnection
).
You can grab the certificate from your webbrowser's store. Let's assume that you're using Firefox.
Now you've a web2.uconn.edu.crt
file.
Next, open the command prompt and import it in the Java key store using the keytool
command (it's part of the JRE):
keytool -import -v -file /path/to/web2.uconn.edu.crt -keystore /path/to/web2.uconn.edu.jks -storepass drowssap
The -file
must point to the location of the .crt
file which you just downloaded. The -keystore
must point to the location of the generated .jks
file (which you in turn want to set as SSL trust store). The -storepass
is required, you can just enter whatever password you want as long as it's at least 6 characters.
Now, you've a web2.uconn.edu.jks
file. You can finally set it as SSL trust store before connecting as follows:
System.setProperty("javax.net.ssl.trustStore", "/path/to/web2.uconn.edu.jks"); Document document = Jsoup.connect("https://web2.uconn.edu/driver/old/timepoints.php?stopid=10").get(); // ...
As a completely different alternative, particularly when you need to deal with multiple sites (i.e. you're creating a world wide web crawler), then you can also instruct Jsoup (basically, java.net.URLConnection
) to blindly trust all SSL certificates. See also section "Dealing with untrusted or misconfigured HTTPS sites" at the very bottom of this answer: Using java.net.URLConnection to fire and handle HTTP requests
In my case, all I needed to do was to add the .validateTLSCertificates(false) in my connection
Document doc = Jsoup.connect(httpsURLAsString) .timeout(60000).validateTLSCertificates(false).get();
I also had to increase the read timeout but I think this is irrelevant
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With