Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wikipedia text download

I am looking to download full Wikipedia text for my college project. Do I have to write my own spider to download this or is there a public dataset of Wikipedia available online?

To just give you some overview of my project, I want to find out the interesting words of few articles I am interested in. But to find these interesting words, I am planning to apply tf/idf to calculate term frequency for each word and pick the ones with high frequency. But to calculate the tf, I need to know the total occurrences in whole of Wikipedia.

How can this be done?

like image 731
Boolean Avatar asked Apr 21 '10 13:04

Boolean


People also ask

How do I download text from Wikipedia?

1. Download the Kiwix app, available for iPhone, iPad, or Android. 2. Choose the file you're interested in (there are smaller files like “Best of Wikipedia,” topic files like “Mathematics,” or the entire encyclopedia) and download it.

Is it possible to download Wikipedia?

In fact, Wikipedia no longer needs an internet connection to access its database — it can be downloaded within a short amount of time. As long as you're prepared for a huge file, it can be done and we're here to show you how.

How can I read Wikipedia offline?

WikiTaxi is an app that lets you download Wikipedia's database to your computer which you can view, search, and browse offline. According to the project page, it's a “single-file application” that “does not require a database engine or HTML browser.”


1 Answers

from wikipedia: http://en.wikipedia.org/wiki/Wikipedia_database

Wikipedia offers free copies of all available content to interested users. These databases can be used for mirroring, personal use, informal backups, offline use or database queries (such as for Wikipedia:Maintenance). All text content is multi-licensed under the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). Images and other files are available under different terms, as detailed on their description pages. For our advice about complying with these licenses, see Wikipedia:Copyrights.

Seems that you are in luck too. From the dump section:

As of 12 March 2010, the latest complete dump of the English-language Wikipedia can be found at http://download.wikimedia.org/enwiki/20100130/ This is the first complete dump of the English-language Wikipedia to have been created since 2008. Please note that more recent dumps (such as the 20100312 dump) are incomplete.

So the data is only 9 days old :)

like image 172
Sam Holder Avatar answered Sep 17 '22 14:09

Sam Holder