I'm using Apache HttpClient in a web crawler that is only for crawling public data.
I'd like it to be able to crawl sites with invalid certificates, no matter how invalid.
My crawler won't be passing in any usernames, passwords, etc and no sensitive data is being sent or received.
For this use case, I'd crawl the http
version of a site if it exists, but sometimes it doesn't of course.
How can this be done with Apache's HttpClient?
I tried a few suggestions like this one, but they still fail for some invalid certs, for example:
failed for url:https://dh480.badssl.com/, reason:java.lang.RuntimeException: Could not generate DH keypair
failed for url:https://null.badssl.com/, reason:Received fatal alert: handshake_failure
failed for url:https://rc4-md5.badssl.com/, reason:Received fatal alert: handshake_failure
failed for url:https://rc4.badssl.com/, reason:Received fatal alert: handshake_failure
failed for url:https://superfish.badssl.com/, reason:Connection reset
Note that I've tried this with my $JAVA_HOME/jre/lib/security/java.security
file's jdk.tls.disabledAlgorithms
set to nothing, to ensure this wasn't an issue, and I still get failures like the above.
InsecureSkipVerify means that there is NO authentication; and it's ripe for a Man-In-The-Middle; defeating the purpose of using TLS.
If you are processing HTTP responses manually instead of using a response handler, you need to close all the http connections by yourself.
An HttpClient can be used to send requests and retrieve their responses. An HttpClient is created through a builder . The builder can be used to configure per-client state, like: the preferred protocol version ( HTTP/1.1 or HTTP/2 ), whether to follow redirects, a proxy, an authenticator, etc.
The short answer to your question, which is to specifically trust all certs, would be to use the TrustAllStrategy and do something like this:
SSLContextBuilder sslContextBuilder = new SSLContextBuilder();
sslContextBuilder.loadTrustMaterial(null, new TrustAllStrategy());
SSLConnectionSocketFactory socketFactory = new SSLConnectionSocketFactory(
sslContextBuilder.build());
CloseableHttpClient httpclient = HttpClients.custom().setSSLSocketFactory(
socketFactory).build();
However... an invalid cert may not be your main issue. A handshake_failure can occur for a number of reasons but in my experience it's usually due to a SSL/TLS version mismatch or cipher suite negotiation failure. This doesn't mean the ssl cert is "bad", it's just a mismatch between the server and client. You can see exactly where the handshake is failing using a tool like Wireshark (more on that)
While Wireshark can be great to see where it's failing, it won't help you come up with a solution. Whenever I've gone about debugging handshake_failures in the past I've found this tool particularly helpful: https://testssl.sh/
You can point that script at any of your failing websites to learn more about what protocols are available on that target and what your client needs to support in order to establish a successful handshake. It will also print information about the certificate.
For example (showing only two sections of the output of testssl.sh):
./testssl.sh www.google.com
....
Testing protocols (via sockets except TLS 1.2, SPDY+HTTP2)
SSLv2 not offered (OK)
SSLv3 not offered (OK)
TLS 1 offered
TLS 1.1 offered
TLS 1.2 offered (OK)
....
Server Certificate #1
Signature Algorithm SHA256 with RSA
Server key size RSA 2048 bits
Common Name (CN) "www.google.com"
subjectAltName (SAN) "www.google.com"
Issuer "Google Internet Authority G3" ("Google Trust Services" from "US")
Trust (hostname) Ok via SAN and CN (works w/o SNI)
Chain of trust "/etc/*.pem" cannot be found / not readable
Certificate Expiration expires < 60 days (58) (2018-10-30 06:14 --> 2019-01-22 06:14 -0700)
....
Testing all 102 locally available ciphers against the server, ordered by encryption strength
(Your /usr/bin/openssl cannot show DH/ECDH bits)
Hexcode Cipher Suite Name (OpenSSL) KeyExch. Encryption Bits
------------------------------------------------------------------------
xc030 ECDHE-RSA-AES256-GCM-SHA384 ECDH AESGCM 256
xc02c ECDHE-ECDSA-AES256-GCM-SHA384 ECDH AESGCM 256
xc014 ECDHE-RSA-AES256-SHA ECDH AES 256
xc00a ECDHE-ECDSA-AES256-SHA ECDH AES 256
x9d AES256-GCM-SHA384 RSA AESGCM 256
x35 AES256-SHA RSA AES 256
xc02f ECDHE-RSA-AES128-GCM-SHA256 ECDH AESGCM 128
xc02b ECDHE-ECDSA-AES128-GCM-SHA256 ECDH AESGCM 128
xc013 ECDHE-RSA-AES128-SHA ECDH AES 128
xc009 ECDHE-ECDSA-AES128-SHA ECDH AES 128
x9c AES128-GCM-SHA256 RSA AESGCM 128
x2f AES128-SHA RSA AES 128
x0a DES-CBC3-SHA RSA 3DES 168
So using this output we can see that if your client only supported SSLv3, the handshake would fail because that protocol isn't supported by the server. The protocol offering is unlikely the problem but you can double check what your java client supports by getting the list of enabled protocols. You can provide an overridden implementation of the SSLConnectionSocketFactory from above code snippet to get the list of enabled/supported protocols and cipher suites as follows (SSLSocket):
class MySSLConnectionSocketFactory extends SSLConnectionSocketFactory {
@Override
protected void prepareSocket(SSLSocket socket) throws IOException {
System.out.println("Supported Ciphers" + Arrays.toString(socket.getSupportedCipherSuites()));
System.out.println("Supported Protocols" + Arrays.toString(socket.getSupportedProtocols()));
System.out.println("Enabled Ciphers" + Arrays.toString(socket.getEnabledCipherSuites()));
System.out.println("Enabled Protocols" + Arrays.toString(socket.getEnabledProtocols()));
}
}
I often encounter handshake_failure when there is a cipher suite negotiation failure. To avoid this error, your client's list of supported cipher suites must contain at least one match to a cipher suite from the server's list of supported cipher suites.
If the server requires AES256 based cipher suites you probably need the java cryptographic extensions (JCE). These libraries are nation restricted so they may not be available to someone outside the US.
More on cryptography restrictions, if you're interested: https://crypto.stackexchange.com/questions/20524/why-there-are-limitations-on-using-encryption-with-keys-beyond-certain-length
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With