I'm trying to build the treegraph of wikipedia articles and its categories. What do I need to do that?
From this site (http://dumps.wikimedia.org/enwiki/latest/), I've downloaded:
I tried followed the answer here (Wikipedia Category Hierarchy from dumps), but it doesn't seem that the categorylinks has the same schema (no pageId column).
What's the right way to build the hierarchy?
Bonus question: How can I tell which of the 35M pages in enwiki-latest-page.sql.gz are articles (supposedly about 5M according to wikipedia statistics)
Thanks
Yes, it turns out this stackoverflow answer was right. It referenced the right datasets, but I was too dense to understand how to relate them together.
Thanks to @svick for leading me through the individual steps in a private chat.
For the benefit of others, I've explicitly detailed the relationship between the data sets and the exact steps to traverse the graph in my blog, which is a summary of our private chat.
Parsing Wikipedia Page Hierarchy
I met the same problem for japanese wikipedia.
I solved this problem as follows:
MariaDB [wikipedia]> select page.page_title from categorylinks join page on page.page_id = categorylinks.cl_from join category on categorylinks.cl_to = category.cat_title where categorylinks.cl_type = 'subcat' and category.cat_title like '学問'; +-----------------------------------+ | page_title | +-----------------------------------+ | 学問の分野 | | 科学 | | 学問スタブ | | 架空の思想・学問 | | 学者 | | 学術出版 | | 学術称号 | | 学術団体 | | 学生 | | 学派 | | 学問の賞 | | 研究 | | 高等教育 | | 知識 | | 問題 | | ルネサンス・ユマニスム | +-----------------------------------+ 16 rows in set (0.00 sec)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With