I had at one time downloaded a wiktionary dump, trying to gather together words and definitions for slavic languages. I approached it using elementtree to go thru the xml file that is the dump. I would avoid trying to scrape or crawl the site, and just download the xml dump that wikimedia provides for wiktionary. Go to the wikimedia downloads, look for the english wiktionary dumps (enwiktionary) and go to the most recent dump. You'll probably want the pages-articles.xml.bz2 file, which is just the article content, no history or comments. Parse this with whatever xml processing libraries you prefer in python. I personally prefer elementtree. Good luck. Wiktionary runs on MediaWiki, which has an API. One of the subpages for the API documentation is Client code, which lists some Python libraries. wordnik has done a good job parsing-out definitions, etc and they have a great api like the others have mentioned, wiktionary is a formatting-disaster, and was not built to be computer-readable Yes, many people parsed Wiktionary. You can usually find past experiences in the Wiktionary-l mailing list archives. A project not mentioned by other answers is DBPedia's Wiktionary RDF extraction. Dozens other research projects parsed Wiktionary: you can find some examples in a recent Wiktionary special and in other issues of the Wikimedia research newsletter. Recently someone also made an English Wiktionary REST API which includes an unspecified subset of the Wiktionary data; future plans for the thing are not known yet. I had a crack at parsing the german wiktionary. I ended up writing it off as too difficult, but I put my (not at all tidied up) code up at https://github.com/benreynwar/wiktionary-parser before I gave up. Although there are conventions used by the editors they are not enforced by anything other than peer oversight. The diversity of templates used along with all the typos in the pages makes the parsing quite challenging. I think the problem is that they've used the same system as for wiktionary which is great for ease of use by the editors, but is not appropriate for the much more structured content of wiktionary. It's a shame because if wiktionary could be easily parsed it would be a hugely useful resource.

Has anyone parsed Wiktionary? [closed]

Tags:

I had at one time downloaded a wiktionary dump, trying to gather together words and definitions for slavic languages. I approached it using elementtree to go thru the xml file that is the dump. I would avoid trying to scrape or crawl the site, and just download the xml dump that wikimedia provides for wiktionary. Go to the wikimedia downloads, look for the english wiktionary dumps (enwiktionary) and go to the most recent dump. You'll probably want the pages-articles.xml.bz2 file, which is just the article content, no history or comments. Parse this with whatever xml processing libraries you prefer in python. I personally prefer elementtree. Good luck.

Wiktionary runs on MediaWiki, which has an API.

One of the subpages for the API documentation is Client code, which lists some Python libraries.

wordnik has done a good job parsing-out definitions, etc and they have a great api

like the others have mentioned, wiktionary is a formatting-disaster, and was not built to be computer-readable

Yes, many people parsed Wiktionary. You can usually find past experiences in the Wiktionary-l mailing list archives.

A project not mentioned by other answers is DBPedia's Wiktionary RDF extraction.

Dozens other research projects parsed Wiktionary: you can find some examples in a recent Wiktionary special and in other issues of the Wikimedia research newsletter.

Recently someone also made an English Wiktionary REST API which includes an unspecified subset of the Wiktionary data; future plans for the thing are not known yet.

I had a crack at parsing the german wiktionary. I ended up writing it off as too difficult, but I put my (not at all tidied up) code up at https://github.com/benreynwar/wiktionary-parser before I gave up. Although there are conventions used by the editors they are not enforced by anything other than peer oversight. The diversity of templates used along with all the typos in the pages makes the parsing quite challenging.

I think the problem is that they've used the same system as for wiktionary which is great for ease of use by the editors, but is not appropriate for the much more structured content of wiktionary. It's a shame because if wiktionary could be easily parsed it would be a hugely useful resource.

Related questions
                            
                                Converting date from Python to Javascript
                            
                                Installing MySQL-python on mac
                            
                                ImportError: No module named 'pandas.indexes'
                            
                                Permission denied error by installing matplotlib
                            
                                Annoying message when opening windows from Python on OS X 10.8
                            
                                dict_items object has no attribute 'sort'
                            
                                Passing variables from Flask to JavaScript
                            
                                Python - requests.exceptions.SSLError - dh key too small
                            
                                How to dynamically set the queryset of a models.ModelChoiceField on a forms.Form subclass
                            
                                Is there a Python equivalent of the Haskell 'let'
                            
                                How to find cube root using Python? [duplicate]
                            
                                ValueError: cannot index with vector containing NA / NaN values
                            
                                Python - Trap all signals
                            
                                Generating pdf-latex with python script
                            
                                Recommended way to manage credentials with multiple AWS accounts?
                            
                                Insert variable into global namespace from within a function? [duplicate]
                            
                                Rounding down integers to nearest multiple
                            
                                How to get user permissions?
                            
                                Error with matplotlib.show() : module 'matplotlib' has no attribute 'show' [duplicate]
                            
                                Python: load words from file into a set

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Has anyone parsed Wiktionary? [closed]

Tags:

python

dictionary

web-services

wiktionary

Recent Activity

Donate For Us