How to crawl entire Wikipedia?

Question

I've tried WebSphinx application.

I realize if I put wikipedia.org as the starting URL, it will not crawl further.

Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?

Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?

Andrew · Accepted Answer

If your goal is to crawl all of Wikipedia, you might want to look at the available database dumps. See http://download.wikimedia.org/.

Dr.Optix · Answer

I'm not sure but maybe WEbSphinx's UserAgent is blocked by wikipedia's robots.txt

http://en.wikipedia.org/robots.txt

How to crawl entire Wikipedia?

Tags:

java

web-crawler

wikipedia

websphinx

Mr CooL

Video Answer

2 Answers

Andrew

Dr.Optix

Recent Activity

Donate For Us

How to crawl entire Wikipedia?

Tags:

java

web-crawler

wikipedia

websphinx

Mr CooL

Video Answer

2 Answers

Andrew

Dr.Optix

Related questions

Recent Activity

Donate For Us