Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would I get a subset of Wikipedia's pages?

How would I get a subset (say 100MB) of Wikipedia's pages? I've found you can get the whole dataset as XML but its more like 1 or 2 gigs; I don't need that much.

I want to experiment with implementing a map-reduce algorithm.

Having said that, if I could just find 100 megs worth of textual sample data from anywhere, that would also be good. E.g. the Stack Overflow database, if it's available, would possibly be a good size. I'm open to suggestions.

Edit: Any that aren't torrents? I can't get those at work.

like image 508
Chris Avatar asked Jan 23 '23 07:01

Chris


2 Answers

The stackoverflow database is available for download.

like image 134
Alex Avatar answered Jan 25 '23 19:01

Alex


Chris, you could just write a small program to hit the Wikipedia "Random Page" link until you get 100MB of web pages: http://en.wikipedia.org/wiki/Special:Random. You'll want to discard any duplicates you might get, and you might also want to limit the number of requests you make per minute (though some fraction of the articles will be served up by intermediate web caches, not Wikipedia servers). But it should be pretty easy.

like image 28
Jim Ferrans Avatar answered Jan 25 '23 19:01

Jim Ferrans