Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding and downloading images within the Wikipedia Dump

I'm trying to find a comprehensive list of all images on wikipedia, which I can then filter down to the public domain ones. I've downloaded the SQL dumps from here:

http://dumps.wikimedia.org/enwiki/latest/

And studied the DB schema:

http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px-MediaWiki_1.20_%2844edaa2%29_database_schema.svg.png

I think I understand it but when I pick a sample image from a wikipedia page I can't find it anywhere in the dumps. For example:

http://en.wikipedia.org/wiki/File:Carrizo_2a.JPG

I've done a grep on the dumps 'image', 'imagelinks', and 'page' looking for 'Carrizo_2a.JPG' and it's not found.

Are these dumps not complete? Am I misunderstanding the structure? Is there a better way to do this?

Also, to jump ahead one step: after I have filtered my list down and I want to download a bulk set of images (thousands) I saw some mentions that I need to do this from a mirror of the site to prevent overloading wikipedia/wikimedia. If has any guidance on this too, that would be helpful.

like image 634
Keith Schacht Avatar asked Apr 05 '13 21:04

Keith Schacht


1 Answers

MediaWiki stores file data in two or three places, depending on how you count:

  • The actual metadata for current file versions is stored in the image table. This is probably what you primarily want; you'll find the latest en.wikipedia dump of it here.

  • Data for old superseded file revisions is moved to the oldimage table, which has basically the same structure as the image table. This table is also dumped, the latest one is here.

  • Finally, each file also (normally) corresponds to a pretty much ordinary wiki page in namespace 6 (File:). You'll find the text of these in the XML dumps, same as for any other pages.

Oh, and the reason you're not finding those files you linked to in the English Wikipedia dumps is that they're from the shared repository at Wikimedia Commons. You'll find them in the Commons data dumps instead.

As for downloading the actual files, here's the (apparently) official documentation. As far as I can tell, all they mean by "Bulk download is currently (as of September 2012) available from mirrors but not offered directly from Wikimedia servers." is that if you want all the images in a tarball, you'll have to use a mirror. If you're only pulling a relatively small subset of the millions on images on Wikipedia and/or Commons, it should be fine to use the Wikimedia servers directly.

Just remember to exercise basic courtesy: send a user-agent string identifying yourself and don't hit the servers too hard. In particular, I'd recommend running the downloads sequentially, so that you only start downloading the next file after you've finished the previous one. Not only is that easier to implement than parallel downloading anyway, but it ensures that you don't hog more than your share of the bandwidth and allows the download speed to more or less automatically adapt to server load.

Ps. Whether you download the files from a mirror or directly from the Wikimedia servers, your going to need to figure out which directory they're in. Typical Wikipedia file URLs look like this:

http://upload.wikimedia.org/wikipedia/en/a/ab/File_name.jpg

where the "wikipedia/en" part identifies the Wikimedia project and language (for historical reasons, Commons is listed as "wikipedia/commons") and the "a/ab" part is given by the first two hex digits of the MD5 hash of the filename in UTF-8 (as they're encoded in the database dumps).

like image 169
Ilmari Karonen Avatar answered Sep 28 '22 12:09

Ilmari Karonen