Parse Wiktionary XML data dump into MySQL database using PHP

Tags:

Alright, I'm just trying to parse Wiktionary Data Dump provided by Wikimedia.

My intention is to parse that XML data dump into MySQL database. I didn't find proper documentation regarding the structure of this XML. Also, I'm not able to open the file because it's infact really huge (~1 GB).

I thought of parsing it using some PHP script but I don't have any idea about the XML structure to proceed. So If anyone had already parsed (or have idea about any tool to parse) into MySQL using PHP, Please share the details. If nothing in PHP, Other methods are also fine.

I just followed this post (http://www.igrec.ca/lexicography/installing-a-local-copy-of-wiktionary-mysql/) but it didn't work out..:( If anybody have succeed in this process, please help. Thanks in Advance.

326

asked Aug 08 '14 09:08

Jenson M John

1 Answers

Those files can be parsed in PHP with XMLReader operating on a compress.bzip2:// stream. The structure of the file you have is exemplary (peeking into ca. the first 3000 elements):

\-mediawiki (1)
  |-siteinfo (1)
  | |-sitename (1)
  | |-base (1)
  | |-generator (1)
  | |-case (1)
  | \-namespaces (1)
  |   \-namespace (40)
  \-page (196)
    |-title (196)
    |-ns (196)
    |-id (196)
    |-restrictions (2)
    |-revision (196)
    | |-id (196)
    | |-parentid (194)
    | |-timestamp (196)
    | |-contributor (196)
    | | |-username (182)
    | | |-id (182)
    | | \-ip (14)
    | |-comment (183)
    | |-text (195)
    | |-sha1 (195)
    | |-model (195)
    | |-format (195)
    | \-minor (99)
    \-redirect (5)

The file itself is a little larger, so it takes quite some time to process. Alternatively do not operate on the XML dumps, but just import the SQL dumps via the mysql commandline tool. SQL dumps are available on the site as well, see all dump formats for the English Wiktionary:

http://dumps.wikimedia.org/enwiktionary/latest/

The overall file was a litte larger with more than 66 849 000 elements:

\-mediawiki (1)
  |-siteinfo (1)
  | |-sitename (1)
  | |-base (1)
  | |-generator (1)
  | |-case (1)
  | \-namespaces (1)
  |   \-namespace (40)
  \-page (3993913)
    |-title (3993913)
    |-ns (3993913)
    |-id (3993913)
    |-restrictions (552)
    |-revision (3993913)
    | |-id (3993913)
    | |-parentid (3572237)
    | |-timestamp (3993913)
    | |-contributor (3993913)
    | | |-username (3982087)
    | | |-id (3982087)
    | | \-ip (11824)
    | |-comment (3917241)
    | |-text (3993913)
    | |-sha1 (3993913)
    | |-model (3993913)
    | |-format (3993913)
    | \-minor (3384811)
    |-redirect (27340)
    \-DiscussionThreading (4698)
      |-ThreadSubject (4698)
      |-ThreadPage (4698)
      |-ThreadID (4698)
      |-ThreadAuthor (4698)
      |-ThreadEditStatus (4698)
      |-ThreadType (4698)
      |-ThreadSignature (4698)
      |-ThreadParent (3605)
      |-ThreadAncestor (3605)
      \-ThreadSummaryPage (11)

answered Nov 03 '22 00:11

hakre

Related questions
                            
                                Are PHP's ASP and script tags removed
                            
                                Cast as int or leave as string?
                            
                                apache .htaccess to nginx rewrite rule
                            
                                Success callback of x-editable is not working
                            
                                Symfony 2 : dev to pro = blank page
                            
                                How to display the Images (from Media Library) under a specific Category in Wordpress?
                            
                                The CodeIgniter set_value() not working?
                            
                                Detect cURL timeout in PHP
                            
                                JQuery Ajax Request with Percentage
                            
                                How to fetch google analytics realtime activeUsers in php?
                            
                                Zend_Framework - No region found within the locale 'zh'
                            
                                Yii- client side validation is not working on CActiveForm
                            
                                If another server/website load image from my server, how to tell?
                            
                                Incorrect integer value '' for a MySQL column that's integer and allow null?
                            
                                join()->where() in Laravel's Eloquent ORM
                            
                                Right way to instantiate class in PHP
                            
                                Laravel Eloquent and Namespacing Issues
                            
                                DDD: Should I use the Symfony EventDispatcher for my domain events?
                            
                                Eurotours XML interface SOAP HTTP unauthorized
                            
                                Google REST-ful call + Guzzle: Setting Authorization Token

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parse Wiktionary XML data dump into MySQL database using PHP

Tags:

php

mysql

xml-parsing

wiktionary

data-dump

Jenson M John

People also ask

1 Answers

hakre

Recent Activity

Donate For Us