Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error opening megawarc archive from Python

I've found myself having to use a python script to access a webarchive.

What I have is a 'megawarc' web archive file from http://archive.org/details/archiveteam-fanfiction-warc-11. I need to un-megawarc this, using the python script found at https://github.com/alard/megawarc.

I'm trying to run the 'restore' command, and I have the three files needed (FILE.warc.gz, FILE.tar, and FILE.json.gz) from the first link.

I have both python 2.7 and 3.3 installed.

--------------update--------------

I've ran both this method..

python megawarc restore FILE

and this method..

Make sure you have the files megawarc and ordereddict.py in the same directory, with the files you want to convert. Rename the file megawarc to megawarc.py Open a python console in this directory

Type the following code (line by line) :

import sys
sys.argv = ['megawarc','restore','FILE']
import megawarc
megawarc.main()

using python 2.7, and this is what I get..

c:\Python27>python megawarc restore FILE
Traceback (most recent call last):
  File "megawarc", line 563, in <module>
main()
  File "megawarc", line 552, in main
mwr.process()
  File "megawarc", line 460, in process
self.process_entry(entry, tar_out)
  File "megawarc", line 478, in process_entry
entry["target"]["offset"], entry["target"]["size"])
  File "megawarc", line 128, in copy_to_stream
raise Exception("End of file: %d bytes expected, but %d bytes read." % (buf_size, l))
Exception: End of file: 4096 bytes expected, but 236 bytes read.

Is there something else i'm missing?

I have the following files all in c:\python27

FILE.megawarc.json.gz

FILE.megawarc.tar

FILE.megawarc.warc.gz

megawarc

ordereddict.py

Is this some type of corrupt file error? Is there something i'm missing?

like image 548
Sarah Waters Avatar asked Jun 12 '13 12:06

Sarah Waters


1 Answers

On the second link you provided, there are two important files :

megawarc
ordereddict.py

The executable script is megawarc. To run it, you have to launch it in a shell with

python megawarc restore FILE

Alternatively, if you're using a UNIX-based system. You can do

chmod +x megawarc

To give megawarc script executable property and then run it with

./megawarc restore FILE

Here, FILE is the actual name you should type if the 3 files you have are FILE.warc.gz, FILE.tar, and FILE.json.gz. You have to change this parameter by the common prefix to your 3 input files if needed.

EDIT :

Okay, i found an alternative that would work if you don't have a standard shell to start the script in command line. What you have to do is :

  • Make sure you have the files megawarc and ordereddict.py in the same directory, with the files you want to convert.
  • Rename the file megawarc to megawarc.py
  • Open a python console in this directory
  • Type the following code (line by line) :

    import sys
    sys.argv = ['megawarc','restore','FILE']
    import megawarc
    megawarc.main()
    

This should work, i've just tried it. Hope it will help.

like image 117
ibi0tux Avatar answered Oct 21 '22 17:10

ibi0tux