Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to crawl a website that has SAML authentication using ManifoldCF or nutch?

I am trying to crawl a website, more specifically a Google Site using ManifoldCF that has SAML authentication and index the crawled data into Apache Solr. But as I crawl the URL, it gives me 302 redirection to login page and then says RESPONSECODENOTINDEXABLE.

I am not sure if have I authenticated correctly or not. In manifoldCF we have options for HTTP basic authentication, NTLM authentication and Session-based access credentials authentication method. I used Session based authentication method which more looks like a form based authentication rather than SAML authentication.

Has anybody crawled a website using manifoldCF which has SAML authentication? And if not manifoldCF, has anyone been able to accomplish this via Apache Nutch, because I am afraid, it also provides only HTTP basic , Digest and NTLM authentication.

Any insight would be helpful. Can provide more information regarding the issue, if anyone here thinks it can easily be accomplished. Basically when I crawl https://sites.google.com/a/my-sub-domain.com, it redirects to SSO login page and crawler refuses to crawl any more giving a 302 error. It's an intranet based website.

like image 935
Saurabh Chaturvedi Avatar asked Aug 08 '16 14:08

Saurabh Chaturvedi


1 Answers

There is no support in Nutch forSSO authentication using SAML. You need to handle it by writing your custom plugin. We have extended proptocol selenium plugin to handle SAML flows.

like image 70
user1264641 Avatar answered Oct 05 '22 05:10

user1264641