Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Wikipedia Category Hierarchy from dumps

Using Wikipedia's dumps I want to build a hierarchy for its categories. I have downloaded the main dump (enwiki-latest-pages-articles) and the category SQL dump (enwiki-latest-category). But I can't find the hierarchy information.

For example, the SQL categories' dump has entries for each category but I can't find anything about how they relate to each other.

The other dump (latest-pages-articles) says the parent categories for each page but in an unordered way. It just states all the parents.

I have seen wikiprep's category hierarchy (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)... How is that one constructed? Wikiprep lists the category ID, not its name. Is there a way to get the name for each ID?

like image 481
fersarr Avatar asked Jul 02 '13 17:07

fersarr


People also ask

How do you read a Wikipedia dump?

Instead, we can access a dump of all of Wikipedia through Wikimedia at dumps.wikimedia.org. (A dump refers to a periodic snapshot of a database). The English version is at dumps.wikimedia.org/enwiki.

How do I find categories on Wikipedia?

At the bottom of an article, you will see a box containing the categories to which that article has been assigned. Simply click any of these categories to go to the corresponding category page.


2 Answers

The category hierarchy information in MediaWiki is stored in the categorylinks table, so you're going to need the categorylinks dump.

You're also going to need the page (not pages-articles) dump for page id to title mapping.

like image 118
svick Avatar answered Oct 23 '22 05:10

svick


Loading the dump of category links etc... to build a wikipedia hierarchy is very long (even if interesting).

I found fast path that give good result. I rely on wikipedia vital articles hierarchy. See for instance, sensimark for an example use.

like image 35
amirouche Avatar answered Oct 23 '22 06:10

amirouche