Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I parse a Wikipedia XML dump with Python?

I have:

import xml.etree.ElementTree as ET


def strip_tag_name(t):
    t = elem.tag
    idx = k = t.rfind("}")
    if idx != -1:
        t = t[idx + 1:]
    return t


events = ("start", "end")

title = None
for event, elem in ET.iterparse('data/enwiki-20190620-pages-articles-multistream.xml', events=events):
    tname = strip_tag_name(elem.tag)

    if event == 'end':
        if tname == 'title':
            title = elem.text
        elif tname == 'page':
            print(title, elem.text)

This seems to give the title just fine, but the page text always seems blank. What am I missing?

I haven't been able to open the file (it's huge), but I think this is an accurate snippet:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.29.0-wmf.12</generator>
    <case>first-letter</case>
    <namespaces>
      ...
    </namespaces>
  </siteinfo>
  <page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      <id>631144794</id>
      <parentid>381202555</parentid>
      <timestamp>2014-10-26T04:50:23Z</timestamp>
      <contributor>
        <username>Paine Ellsworth</username>
        <id>9092818</id>
      </contributor>
      <comment>add [[WP:RCAT|rcat]]s</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]]

\{\{Redr|move|from CamelCase|up\}\}</text>
      <sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
    </revision>
  </page>
  <page>
    <title>Anarchism</title>
    <ns>0</ns>
    <id>12</id>
    <revision>
      <id>766348469</id>
      <parentid>766047928</parentid>
      <timestamp>2017-02-19T18:08:07Z</timestamp>
      <contributor>
        <username>GreenC bot</username>
        <id>27823944</id>
      </contributor>
      <minor />
      <comment>Reformat 1 archive link. [[User:Green Cardamom/WaybackMedic_2.1|Wayback Medic 2.1]]</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve">
      ...
      </text>
    </revision>
  </page>
</mediawiki>
like image 398
Shamoon Avatar asked Jul 04 '19 12:07

Shamoon


3 Answers

The best approach is to use a the MWXML python package which is part of the Mediawiki Utilities (installable with pip3 install mwxml). MWXML is designed to solve this specific problem and is widely used. The software was created by research staff at the Wikimedia Foundation and is maintained by a set of researchers inside and outside of the foundation.

Here's a code example adapted from an example notebook distributed with the library that prints out page IDs, revision IDs, timestamp, and the length of the text:

import mwxml
import glob

paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')

def process_dump(dump, path):
  for page in dump:
    for revision in page:
        yield page.id, revision.id, revision.timestamp, len(revision.text)

for page_id, rev_id, rev_timestamp, rev_textlength in mwxml.map(process_dump, paths):
    print("\t".join(str(v) for v in [page_id, rev_id, rev_timestamp, rev_textlength]))

The full example from which this is adapted reports the number of added and removed image links within each revision. It is fully documented but includes only 25 lines of code.

like image 115
mako Avatar answered Nov 15 '22 09:11

mako


The text refers to the text between the element tags (i.e. <tag>text</tag>) and not to all the child elements. Thus, in case of the title element one has:

<title>AccessibleComputing</title>

and the text between the tags is AccessibleComputing.

In the case of the page element, the only text defined is '\n ' and there are other child elements (see below), including the title element:

<page>
    <title>Anarchism</title>
    <ns>0</ns>
    <id>12</id>
    ... 
</page>

See more details in w3schools page

If you want to parse the file, I would recomend to use either findall method:

from lxml import etree
from lxml.etree import tostring

tree = etree.parse('data/enwiki-20190620-pages-articles-multistream.xml')
root = tree.getroot()
# iterate through all the titles
for title in root.findall(".//title", namespaces=root.nsmap):
    print(tostring(title))
    print(title.text)

which generates this output:

b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">AccessibleComputing</title>\n    '
AccessibleComputing
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Anarchism</title>\n    '
Anarchism

or the xpath method:

nsmap = root.nsmap
nsmap['x'] = root.nsmap[None]
nsmap.pop(None)
# iterate through all the pages
for page in root.findall(".//x:page", namespaces=nsmap):
    print(page)
    print(repr(page.text)) # which prints '\n    '
    print('number of children: %i' % len(page.getchildren()))

and the output is:

<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc610c8>
'\n    '
number of children: 5
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc71bc8>
'\n    '
number of children: 5

Please see lxml tutorial for more details.

like image 2
user7440787 Avatar answered Nov 15 '22 08:11

user7440787


You are trying to get the content of the text property of the <page> element, but that is just whitespace.

To get the text of the <text> element, just change

elif tname == 'page':

to

elif tname == 'text':
like image 2
mzjn Avatar answered Nov 15 '22 09:11

mzjn