Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shelve dictionary size is >100Gb for a 2Gb text file

I am creating a shelve file of sequences from a genomic FASTA file:

# Import necessary libraries
import shelve
from Bio import SeqIO

# Create dictionary of genomic sequences
genome = {}
with open("Mus_musculus.GRCm38.dna.primary_assembly.fa") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        genome[str(record.id)] = str(record.seq)

# Shelve genome sequences
myShelve = shelve.open("Mus_musculus.GRCm38.dna.primary_assembly.db")
myShelve.update(genome)
myShelve.close()

The file itself is 2.6Gb, however when I try and shelve it, a file of >100Gb is being produced, plus my computer will throw out a number of complaints about being out of memory and the start up disk being full. This only seems to happen when I try to run this under OSX Yosemite, on Ubuntu it works as expected. Any suggestions why this is not working? I'm using Python 3.4.2

like image 438
jma1991 Avatar asked Oct 20 '22 21:10

jma1991


2 Answers

Verify what interface is used for dbm by import dbm; print(dbm.whichdb('your_file.db') The file format used by shelve depends on the best installed binary package available on your system and its interfaces. The newest is gdbm, while dumb is a fallback solution if no binary is found, ndbm is something between.
https://docs.python.org/3/library/shelve.html
https://docs.python.org/3/library/dbm.html

It is not favourable to have all data in the memory if you lose all memory for filesystem cache. Updating by smaller blocks is better. I even don't see a slowdown if items are updated one by one.

myShelve = shelve.open("Mus_musculus.GRCm38.dna.primary_assembly.db")
with open("Mus_musculus.GRCm38.dna.primary_assembly.fa") as handle:
    for i, record in enumerate(SeqIO.parse(handle, "fasta")):
        myShelve.update([(str(record.id), str(record.seq))])
myShelve.close()

It is known that dbm databases became fragmented if the app fell down after updates without calling database close. I think that this was your case. Now you probably have no important data yet in the big file, but in the future you can defragment a database by gdbm.reorganize().

like image 135
hynekcer Avatar answered Oct 27 '22 23:10

hynekcer


I had the very same problem: On a macOS system with a shelve with about 4 Megabytes of data grew to the enormous size of 29 Gigabytes on disk! This obviously happened because I updated the same key value pairs in the shelve over and over again.

As my shelve was based on GNU dbm I was able to use his hint about reorganizing. Here is the code that brought my shelve file back to normal size within seconds:

import dbm
db = dbm.open(shelfFileName, 'w')
db.reorganize()
db.close()

I am not sure whether this technique will work for other (non GNU) dbms as well. To test your dbm system, remember the code shown by @hynekcer:

import dbm
print( dbm.whichdb(shelfFileName) )

If GNU dbm is used by your system this should output 'dbm.gnu' (which is the new name for the older gdbm).

like image 44
Jpsy Avatar answered Oct 28 '22 00:10

Jpsy