I am trying to crawl a website, more specifically a Google Site
using ManifoldCF
that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302
redirection to login page and then says RESPONSECODENOTINDEXABLE
.
I am not sure if have I authenticated correctly or not. In manifoldCF we have options for HTTP basic
authentication, NTLM authentication
and Session-based
access credentials authentication method. I used Session based
authentication method which more looks like a form based authentication rather than SAML
authentication.
Has anybody crawled a website using manifoldCF which has SAML
authentication? And if not manifoldCF
, has anyone been able to accomplish this via Apache Nutch, because I am afraid, it also provides only HTTP
basic , Digest
and NTLM
authentication.
Any insight would be helpful. Can provide more information regarding the issue, if anyone here thinks it can easily be accomplished. Basically when I crawl https://sites.google.com/a/my-sub-domain.com, it redirects to SSO login page and crawler refuses to crawl any more giving a 302 error. It's an intranet based website.
There is no support in Nutch forSSO authentication using SAML. You need to handle it by writing your custom plugin. We have extended proptocol selenium plugin to handle SAML flows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With