Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating plain text from a Wikipedia database dump

I found a Python script (here: Wikipedia Extractor) that can generate plain text from (English) Wikipedia database dump. When I use this command (as it's stated on the script's page):

$ python enwiki-latest-pages-articles.xml WikiExtractor.py -b 500K -o extracted

I get this error:

File "enwiki-latest-pages-articles.xml", line 1 < mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="en">

^
SyntaxError: invalid syntax

I'm executing the script using Python 2.7.6 & Cygwin on Windows 7.

I hope If anyone has already used this script or experience with Python can help me to solve this error.

Thanks in advance!

like image 204
Asim Avatar asked Mar 31 '14 21:03

Asim


1 Answers

The first argument to python should be the script name.

You probably need to swap xml and py file names:

$ python WikiExtractor.py enwiki-latest-pages-articles.xml -b 500K -o extracted
like image 148
alecxe Avatar answered Sep 28 '22 09:09

alecxe