Using Wikipedia's dumps I want to build a hierarchy for its categories. I have downloaded the main dump (enwiki-latest-pages-articles) and the category SQL dump (enwiki-latest-category). But I can't find the hierarchy information.
For example, the SQL categories' dump has entries for each category but I can't find anything about how they relate to each other.
The other dump (latest-pages-articles) says the parent categories for each page but in an unordered way. It just states all the parents.
I have seen wikiprep's category hierarchy (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)... How is that one constructed? Wikiprep lists the category ID, not its name. Is there a way to get the name for each ID?
Instead, we can access a dump of all of Wikipedia through Wikimedia at dumps.wikimedia.org. (A dump refers to a periodic snapshot of a database). The English version is at dumps.wikimedia.org/enwiki.
At the bottom of an article, you will see a box containing the categories to which that article has been assigned. Simply click any of these categories to go to the corresponding category page.
The category hierarchy information in MediaWiki is stored in the categorylinks
table, so you're going to need the categorylinks
dump.
You're also going to need the page
(not pages-articles
) dump for page id to title mapping.
Loading the dump of category links etc... to build a wikipedia hierarchy is very long (even if interesting).
I found fast path that give good result. I rely on wikipedia vital articles hierarchy. See for instance, sensimark for an example use.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With