Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to extract Wiktionary data without scraping?

I know there's DBPedia for Wikipedia, but does something like that exist for Wiktionary? I'd like to get something like https://en.wiktionary.org/wiki/Category:en:Occupations into JSON or similar format.

like image 436
Jonathan Avatar asked Sep 12 '25 16:09

Jonathan


2 Answers

Another way to go would be to load wiktionary category SQL dump into mysql from wikimedia data dump e.g. enwiktionary-20190901-category.sql.gz.

Then use https://en.wiktionary.org/api/rest_v1/ to retrieve (and parse!) the html for the info you need.

Good luck!

like image 91
amirouche Avatar answered Sep 15 '25 05:09

amirouche


There is DBpedia for wikipedia and there is DBnary for wiktionary. See http://kaiko.getalp.org/about-dbnary

TLDR: DBnary extracts 25 language editions of wiktionary and produces an RDF dataset (using ontolex ontology) that can be imported in a quad store and queried. New version twice a month.

Drawback, not all data is extracted and modeled, you can file in a feature request at the dbanry extractor gitlab: https://gitlab.com/gilles.serasset/dbnary

The Categories are usually not extracted as these come from template processing and would require to transclude every page of every editions for every dump and transclusion is not cheap (expecially when it implies Lua, as it is the case for most pages in the English edition).

Note: I am the author of DBnary...

like image 29
dodecaplex Avatar answered Sep 15 '25 05:09

dodecaplex