I have:
import xml.etree.ElementTree as ET
def strip_tag_name(t):
t = elem.tag
idx = k = t.rfind("}")
if idx != -1:
t = t[idx + 1:]
return t
events = ("start", "end")
title = None
for event, elem in ET.iterparse('data/enwiki-20190620-pages-articles-multistream.xml', events=events):
tname = strip_tag_name(elem.tag)
if event == 'end':
if tname == 'title':
title = elem.text
elif tname == 'page':
print(title, elem.text)
This seems to give the title just fine, but the page text
always seems blank. What am I missing?
I haven't been able to open the file (it's huge), but I think this is an accurate snippet:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<dbname>enwiki</dbname>
<base>https://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.29.0-wmf.12</generator>
<case>first-letter</case>
<namespaces>
...
</namespaces>
</siteinfo>
<page>
<title>AccessibleComputing</title>
<ns>0</ns>
<id>10</id>
<redirect title="Computer accessibility" />
<revision>
<id>631144794</id>
<parentid>381202555</parentid>
<timestamp>2014-10-26T04:50:23Z</timestamp>
<contributor>
<username>Paine Ellsworth</username>
<id>9092818</id>
</contributor>
<comment>add [[WP:RCAT|rcat]]s</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">#REDIRECT [[Computer accessibility]]
\{\{Redr|move|from CamelCase|up\}\}</text>
<sha1>4ro7vvppa5kmm0o1egfjztzcwd0vabw</sha1>
</revision>
</page>
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
<revision>
<id>766348469</id>
<parentid>766047928</parentid>
<timestamp>2017-02-19T18:08:07Z</timestamp>
<contributor>
<username>GreenC bot</username>
<id>27823944</id>
</contributor>
<minor />
<comment>Reformat 1 archive link. [[User:Green Cardamom/WaybackMedic_2.1|Wayback Medic 2.1]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve">
...
</text>
</revision>
</page>
</mediawiki>
The best approach is to use a the MWXML python package which is part of the Mediawiki Utilities (installable with pip3 install mwxml
). MWXML is designed to solve this specific problem and is widely used. The software was created by research staff at the Wikimedia Foundation and is maintained by a set of researchers inside and outside of the foundation.
Here's a code example adapted from an example notebook distributed with the library that prints out page IDs, revision IDs, timestamp, and the length of the text:
import mwxml
import glob
paths = glob.glob('/public/dumps/public/nlwiki/20151202/nlwiki-20151202-pages-meta-history*.xml*.bz2')
def process_dump(dump, path):
for page in dump:
for revision in page:
yield page.id, revision.id, revision.timestamp, len(revision.text)
for page_id, rev_id, rev_timestamp, rev_textlength in mwxml.map(process_dump, paths):
print("\t".join(str(v) for v in [page_id, rev_id, rev_timestamp, rev_textlength]))
The full example from which this is adapted reports the number of added and removed image links within each revision. It is fully documented but includes only 25 lines of code.
The text refers to the text between the element tags (i.e. <tag>text</tag>
) and not to all the child elements. Thus, in case of the title
element one has:
<title>AccessibleComputing</title>
and the text between the tags is AccessibleComputing
.
In the case of the page
element, the only text defined is '\n '
and there are other child elements (see below), including the title
element:
<page>
<title>Anarchism</title>
<ns>0</ns>
<id>12</id>
...
</page>
See more details in w3schools page
If you want to parse the file, I would recomend to use either findall
method:
from lxml import etree
from lxml.etree import tostring
tree = etree.parse('data/enwiki-20190620-pages-articles-multistream.xml')
root = tree.getroot()
# iterate through all the titles
for title in root.findall(".//title", namespaces=root.nsmap):
print(tostring(title))
print(title.text)
which generates this output:
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">AccessibleComputing</title>\n '
AccessibleComputing
b'<title xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">Anarchism</title>\n '
Anarchism
or the xpath
method:
nsmap = root.nsmap
nsmap['x'] = root.nsmap[None]
nsmap.pop(None)
# iterate through all the pages
for page in root.findall(".//x:page", namespaces=nsmap):
print(page)
print(repr(page.text)) # which prints '\n '
print('number of children: %i' % len(page.getchildren()))
and the output is:
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc610c8>
'\n '
number of children: 5
<Element {http://www.mediawiki.org/xml/export-0.10/}page at 0x7ff75cc71bc8>
'\n '
number of children: 5
Please see lxml tutorial for more details.
You are trying to get the content of the text
property of the <page>
element, but that is just whitespace.
To get the text
of the <text>
element, just change
elif tname == 'page':
to
elif tname == 'text':
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With