I have a binary file written using pickle.dump containing logs from an app (a set of tuples of floats and strings), it worked great for 24h but now when trying to read it using
import pickle
path = "/path/to/file.pkl"
with open(path, 'rb') as f:
score_board = pickle.load(f)
I get UnpicklingError: invalid load key, '\x00'.
I think there is a null value somewhere corrupting the file, I know the error is in the last inputs, as the file ceased to be updated when the error occured. each tuple in the set contains (a score [float], a short sentence [str], a username [str], a datetime [str]). I was wondering if there was a way for me to only open the file until a certain point, or even to edit it manually, to make it safe to read.
thanks in advance
Add the right opcodes to the end of your pickle file. Mine should end in 'bsbuu.' but instead ends with 'bsbj' (the opcodes between items). Editing the binary file to put 'bsbuu.' at the end instead of 'bsbj' fixed it. Now my file opens perfectly. Eat it, pickle devs.
Your opcodes are probably different since it depends on the data structures in your file and where it got cut off. Make a good pickle file with the same data structure, look at the opcodes, and adjust your corrupted file as needed.
You can use this to see the opcodes in your pickle file:
python -m pickletools file
You can use this to turn a binary pickle file into plaintext hex, edit it, then turn it back to binary:
xxd picklefile hexfile
(edit file)
xxd -r hexfile newpicklefile
First off, let's get this out of the way: pickle sucks. It's an awful format. Binary crap with zero ability to recover any data. Even so-called python experts say "bad pickle file? kiss your data goodbye and start over." It's a complete joke of a format.
Python docs aren't any better. Zero tools for recovering partial data. Zero ability to deal with errors. Best they can do is print opcodes and tell you "write your own unpickler". The attitude is so user unfriendly it even makes IBM blush.
My data is pretty straightforward. It's a big old dict of dicts. Like this:
data = {
"key1" : {
"value1" : binarystring ,
"value2" : binarystring ,
"value3" : binarystring ,
} ,
"key2" : {
"value1" : binarystring ,
"value2" : binarystring ,
"value3" : binarystring ,
} ,
...
}
My write got interrupted near the end. Most of my data is still there in the corrupted pickle file. But the stupid thing won't open. pickle.load reads 80 MB of data and says "Oops end of file, sorry no data for you". Which is garbage. pickle could at least return the data it read successfully, if you pass an error flag or something. Nope, pickle refuses. NO PICKLE FOR YOU!!
I want to recover the data that's there. I'm not gonna write my own unpickler, figuring out all those opcodes. That's a stupid solution and anyone who proposes it should be ashamed of themselves. There's gotta be a better way.
I made a better way. Pickle files has a bunch of opcodes between data to define the structure. You can use this command to display the data and opcodes in your file:
python -m pickletools file
I noticed that in a sample pickle file containing data structure above, each sub-dict (i.e. the dict under "key1") ends with 'bsbu' before the next item starts (i.e. "key2"). Then the files ends with 'bsbuu.' It seems the last 'u.' is an opcode that means 'end the (top-level) dict and end data'.
My corrupt files ends with 'bsbj' instead of 'bsbuu.' So if I change the last part of the file to 'bsbuu.' instead, then it should close both dicts and end the data. Stands to reason, right?
I don't have a good binary editor handy, so I used this to convert the binary pickle file to a plaintext hex file:
xxd picklefile hexfile
Changed the opcodes at the end, then converted back to binary with:
xxd -r hexfile newpicklefile
Called pickle.load on the new file and lo and behold, Bob's your uncle, it works! I recovered 80 MB of data. Thanks for nothing, python! >:(
Yes I know I didn't get all my data back. But I got a lot of it, no thanks to python. Telling users to eat it and start over is not helpful in any way.
In future I'll consider moving to json for storing my data. It's a pain because the binary strings don't serialize to json out of the box. I'll need to make a converter for both json.dump and json.load, probably turn them into hex strings. It's a bit of work. But it's far far FAR easier to recover a corrupted json file than a pickle file.
Yeah, pickle does a lot more than json, storing arbitrary objects, executable code, yada yada yada. What good is that if pickle doesn't function properly? If one little gnat gets into the datafile and pickle has a full-on meltdown, running away screaming and refusing to go back in the house. Cmon man. Grow up, python.
In the real world, errors happen. People and code deal with them. A default freakout is fine. But not providing any other options is inexcusable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With