How would I get a subset of Wikipedia's pages?

Question

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.

I want to experiment with implementing a map-reduce algorithm.

Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.

Edit: Any that aren't torrents? I can't get those at work.

Alex · Accepted Answer

The stackoverflow database is available for download.

Jim Ferrans · Answer

Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.

How would I get a subset of Wikipedia's pages?

Tags:

wiki

mapreduce

sample-data

Chris

2 Answers

Alex

Jim Ferrans

Recent Activity

Donate For Us

How would I get a subset of Wikipedia's pages?

Tags:

wiki

mapreduce

sample-data

Chris

2 Answers

Alex

Jim Ferrans

Related questions

Recent Activity

Donate For Us