Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best Way to handle Large List of Dictionaries in Python

I am performing a statistical test that uses 10,000 permutations as a null distribution.

Each of the permutations is a 10,000 key dictionary. Each key is a gene, each value is a set of patients corresponding to the gene. This dictionary is programmatically generated, and can be written to and read in from a file.

I want to be able to iterate over these permutations to perform my statistical test; however, keeping this large list on the stack is slowing down my performance.

Is there a way to keep these dictionaries on stored memory and yield the permutations as I iterate over them?

Thank you!

like image 869
Jonathan Lu Avatar asked Aug 29 '15 18:08

Jonathan Lu


1 Answers

This is a general computing problem; you want the speed of memory-stored data but don't have enough memory. You have at least these options:

  • Buy more RAM (obviously)
  • Let the process swap. This leaves it to the OS to decide which data to store on disk and which to store in memory
  • Don't load everything into memory at once

Since you are iterating over your dataset, one solution could be to load data lazily:

def get_data(filename):
    with open(filename) as f:
        while True:
            line = f.readline()
            if line:
                yield line
            break

for item in get_data('my_genes.dat'):
    gather_statistics(deserialize(item))

A variant is to split your data into multiple files or store your data in a database so you can batch process your data n items at a time.

like image 62
Erik Cederstrand Avatar answered Nov 15 '22 05:11

Erik Cederstrand