Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to crawl entire Wikipedia?

I've tried WebSphinx application.

I realize if I put wikipedia.org as the starting URL, it will not crawl further.

Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?

Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?

like image 331
Mr CooL Avatar asked Feb 22 '10 20:02

Mr CooL


Video Answer


2 Answers

If your goal is to crawl all of Wikipedia, you might want to look at the available database dumps. See http://download.wikimedia.org/.

like image 139
Andrew Avatar answered Oct 11 '22 17:10

Andrew


I'm not sure but maybe WEbSphinx's UserAgent is blocked by wikipedia's robots.txt

http://en.wikipedia.org/robots.txt

like image 43
Dr.Optix Avatar answered Oct 11 '22 18:10

Dr.Optix