Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nutch does not crawl all links in form

I have problem to crawling my site...there is a form with two drop-down lists....and when I start crawl , the crawler fetch only part of links from form....from first drop-down list it takes part of options, as from second drop-down....I try change some configurations in nutch-defaults.xml file, but everything is the same...

I change 
fetcher.threads.per.queue  1 - 10         
db.ignore.internal.links true - false  
db.ignore.external.links false - true  
http.content.limit    65536 - 65536000  
file.content.limit    65536 - 65536000  
db.update.max.inlinks  10.000 - 100.000

is there any other option, that can help me to crawl all options in my form......?? Thanks for answers.

like image 255
Hayk Grigoryan Avatar asked Nov 12 '22 20:11

Hayk Grigoryan


1 Answers

Sorry, too low rep to post comment!!!

Have you got a link.

Also are the drop downs ajax or something fancy. Nutch from memory will only crawl what is on the page. I.e. if you load the first 10 on page load and the only load the rest with a service when the user scrolls I believe it can't find that.

Some more info would be good re the page....

Cheers Robin

like image 99
Robin Rieger Avatar answered Dec 12 '22 02:12

Robin Rieger