The problem that I am having is that I have a very large pickle file (2.6 Gb) that I am trying to open but each time I do so I get a memory error. I realize now that I should have used a database to store all the information but its too late now. The pickle file contains dates and text from the U.S. Congressional record that was crawled from the internet (took about 2 weeks to run).
Is there any way I can access the information that I dumped into the pickle file incrementally or a way to convert the pickle file into a SQL database or something else that I can open without having to re-input all the data. I really don't want to have to spend another 2 weeks re-crawling the congressional record and imputing the data into a database.
Thanks a bunch for your help
EDIT*
code for how the object gets pickled:
def save_objects(objects):
with open('objects.pkl', 'wb') as output:
pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)
def Main():
Links()
file = open('datafile.txt', 'w')
objects = []
with open('links2.txt', 'rb') as infile:
for link in infile:
print(link)
title, text, date = Get_full_text(link)
article=Doccument(title, date, text)
if text != None:
write_to_text(date, text)
objects.append(article)
save_objects(objects)
This is the program with the error:
def Main():
file = open('objects1.pkl', 'rb')
object = pickle.load(file)
Compressing Pickle File Data Basically all we have to do, is use the bz2. BZ2File Class, instead of the standard open() function seen in regular File Handling. Likewise, you can also use the bz2. open() function, which will provide the same compression effect.
Moreover, it is not possible to pickle open file objects, database connections, and network connections. The first step to unpickle a file is to load it back into a python program. Use the open() command to open the file with the 'rb' argument as it indicated to open the file in 'read' mode.
pickle file and more than one second faster than it took to compress. The . pickle file and the . csv files took up about the same space, around 40 MB, but the compressed pickle file took up only 1.5 MB.
By default, the pickle data format uses a relatively compact binary representation. If you need optimal size characteristics, you can efficiently compress pickled data.
Looks like you're in a bit of a pickle! ;-). Hopefully after this, you'll NEVER USE PICKLE EVER. It's just not a very good data storage format.
Anyways, for this answer I'm assuming your Document
class looks a bit like this. If not, comment with your actual Document
class:
class Document(object): # <-- object part is very important! If it's not there, the format is different!
def __init__(self, title, date, text): # assuming all strings
self.title = title
self.date = date
self.text = text
Anyways, I made some simple test data with this class:
d = [Document(title='foo', text='foo is good', date='1/1/1'), Document(title='bar', text='bar is better', date='2/2/2'), Document(title='baz', text='no one likes baz :(', date='3/3/3')]
Pickled it with format 2
(pickle.HIGHEST_PROTOCOL
for Python 2.x)
>>> s = pickle.dumps(d, 2)
>>> s
'\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06U\rbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'
And disassembled it with pickletools
:
>>> pickletools.dis(s)
0: \x80 PROTO 2
2: ] EMPTY_LIST
3: q BINPUT 0
5: ( MARK
6: c GLOBAL '__main__ Document'
25: q BINPUT 1
27: ) EMPTY_TUPLE
28: \x81 NEWOBJ
29: q BINPUT 2
31: } EMPTY_DICT
32: q BINPUT 3
34: ( MARK
35: U SHORT_BINSTRING 'date'
41: q BINPUT 4
43: U SHORT_BINSTRING '1/1/1'
50: q BINPUT 5
52: U SHORT_BINSTRING 'text'
58: q BINPUT 6
60: U SHORT_BINSTRING 'foo is good'
73: q BINPUT 7
75: U SHORT_BINSTRING 'title'
82: q BINPUT 8
84: U SHORT_BINSTRING 'foo'
89: q BINPUT 9
91: u SETITEMS (MARK at 34)
92: b BUILD
93: h BINGET 1
95: ) EMPTY_TUPLE
96: \x81 NEWOBJ
97: q BINPUT 10
99: } EMPTY_DICT
100: q BINPUT 11
102: ( MARK
103: h BINGET 4
105: U SHORT_BINSTRING '2/2/2'
112: q BINPUT 12
114: h BINGET 6
116: U SHORT_BINSTRING 'bar is better'
131: q BINPUT 13
133: h BINGET 8
135: U SHORT_BINSTRING 'bar'
140: q BINPUT 14
142: u SETITEMS (MARK at 102)
143: b BUILD
144: h BINGET 1
146: ) EMPTY_TUPLE
147: \x81 NEWOBJ
148: q BINPUT 15
150: } EMPTY_DICT
151: q BINPUT 16
153: ( MARK
154: h BINGET 4
156: U SHORT_BINSTRING '3/3/3'
163: q BINPUT 17
165: h BINGET 6
167: U SHORT_BINSTRING 'no one likes baz :('
188: q BINPUT 18
190: h BINGET 8
192: U SHORT_BINSTRING 'baz'
197: q BINPUT 19
199: u SETITEMS (MARK at 153)
200: b BUILD
201: e APPENDS (MARK at 5)
202: . STOP
Looks complex! But really, it's not so bad. pickle
is basically a stack machine, each ALL_CAPS identifier you see is an opcode, which manipulates the internal "stack" in some way for decoding. If we were trying to parse some complex structure, this would be more important, but luckily we're just making a simple list of essentially-tuples. All this "code" is doing is constructing a bunch of objects on the stack, and then pushing the entire stack into a list.
The one thing we DO need to care about are the 'BINPUT' / 'BINGET' opcodes you see scattered around. Basically, these are for 'memoization', to reduce data footprint, pickle
saves strings with BINPUT <id>
, and then if they come up again, instead of re-dumping them, simply puts a BINGET <id>
to retrieve them from the cache.
Also, another complication! There's more than just SHORT_BINSTRING
- there's normal BINSTRING
for strings > 256 bytes, and also some fun unicode variants as well. I'll just assume that you're using Python 2 with all ASCII strings. Again, comment if this isn't a correct assumption.
OK, so we need to stream the file until we hit a '\81' bytes (NEWOBJ
). Then, we need to scan forward until we hit a '(' (MARK
) character. Then, until we hit a 'u' (SETITEMS
), we read pairs of key/value strings - there should be 3 pairs total, one for each field.
So, lets do this. Here's my script to read pickle data in streaming fashion. It's far from perfect, since I just hacked it together for this answer, and you'll need to modify it a lot, but it's a good start.
pickledata = '\x80\x02]q\x00(c__main__\nDocument\nq\x01)\x81q\x02}q\x03(U\x04dateq\x04U\x051/1/1q\x05U\x04textq\x06U\x0bfoo is goodq\x07U\x05titleq\x08U\x03fooq\tubh\x01)\x81q\n}q\x0b(h\x04U\x052/2/2q\x0ch\x06T\x14\x05\x00\x00bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterq\rh\x08U\x03barq\x0eubh\x01)\x81q\x0f}q\x10(h\x04U\x053/3/3q\x11h\x06U\x13no one likes baz :(q\x12h\x08U\x03bazq\x13ube.'
# simulate a file here
import StringIO
picklefile = StringIO.StringIO(pickledata)
import pickle # just for opcode names
import struct # binary unpacking
def try_memo(f, v, cache):
opcode = f.read(1)
if opcode == pickle.BINPUT:
cache[f.read(1)] = v
elif opcode == pickle.LONG_BINPUT:
print 'skipping LONG_BINPUT to save memory, LONG_BINGET will probably not be used'
f.read(4)
else:
f.seek(f.tell() - 1) # rewind
def try_read_string(f, opcode, cache):
if opcode in [ pickle.SHORT_BINSTRING, pickle.BINSTRING ]:
length_type = 'b' if opcode == pickle.SHORT_BINSTRING else 'i'
str_length = struct.unpack(length_type, f.read(struct.calcsize(length_type)))[0]
value = f.read(str_length)
try_memo(f, value, memo_cache)
return value
elif opcode == pickle.BINGET:
return memo_cache[f.read(1)]
elif opcide == pickle.LONG_BINGET:
raise Exception('Unexpected LONG_BINGET? Key ' + f.read(4))
else:
raise Exception('Invalid opcode ' + opcode + ' at pos ' + str(f.tell()))
memo_cache = {}
while True:
c = picklefile.read(1)
if c == pickle.NEWOBJ:
while picklefile.read(1) != pickle.MARK:
pass # scan forward to field instantiation
fields = {}
while True:
opcode = picklefile.read(1)
if opcode == pickle.SETITEMS:
break
key = try_read_string(picklefile, opcode, memo_cache)
value = try_read_string(picklefile, picklefile.read(1), memo_cache)
fields[key] = value
print 'Document', fields
# insert to sqllite
elif c == pickle.STOP:
break
This correctly reads my test data in pickle format 2 (modified to have a long string):
$ python picklereader.py
Document {'date': '1/1/1', 'text': 'foo is good', 'title': 'foo'}
Document {'date': '2/2/2', 'text': 'bar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is betterbar is better', 'title': 'bar'}
Document {'date': '3/3/3', 'text': 'no one likes baz :(', 'title': 'baz'}
Good luck!
You didn't pickle your data incrementally. You pickled your data monolithically and repeatedly. Each time around the loop, you destroyed whatever output data you had (open(...,'wb')
destroys the output file), and re-wrote all of the data again. Additionally, if your program ever stopped and then restarted with new input data, the old output data was lost.
I do not know why objects
didn't cause an out-of-memory error while you were pickling, because it grew to the same size as the object that pickle.load()
wants to create.
Here is how you could have created the pickle file incrementally:
def save_objects(objects):
with open('objects.pkl', 'ab') as output: # Note: `ab` appends the data
pickle.dump(objects, output, pickle.HIGHEST_PROTOCOL)
def Main():
...
#objects=[] <-- lose the objects list
with open('links2.txt', 'rb') as infile:
for link in infile:
...
save_objects(article)
Then you could have incrementally read the pickle file like so:
import pickle
with open('objects.pkl', 'rb') as pickle_file:
try:
while True:
article = pickle.load(pickle_file)
print article
except EOFError:
pass
The choices I can think of are:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With