Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Many dictionaries using massive amounts of RAM

I have a very simple Python script to create (for test purposes), 35 million dictionary objects within a list. Each dictionary object contains two key/value pairs. eg.

{'Name': 'Jordan', 'Age': 35}

The script very simply take a query on name and age, searches through the list of dictionaries and returns a new list containing the index of all matching dictionary entries.

However as you can see below, an insane amount of memory is consumed. I presume I am making a very naive mistake somewhere.

screenshot of code and task manager show ram usage

My code is as follows: (can also be viewed in the image if more readable).

import sys

# Firstly, we will create 35 million records in memory, all will be the same apart from one

def search(key, value, data, age):
    print("Searching, please wait")
    # Create list to store returned PKs
    foundPKS = []
    for index in range(0, len(data)):
        if key in data[index] and 'Age' in data[index]:
            if data[index][key] == value and data[index]['Age'] >= age:
                foundPKS.append(index)
    results = foundPKS
    return results

def createdata():
    # Let's create our list for storing our dictionaries
    print("Creating database, please wait")
    dictList = []
    for index in range(0, 35000000):
        # Define dictionary
        record = {'Name': 'Jordan', 'Age': 25}
        if 24500123 <= index <= 24500200:
            record['Name'] = 'Chris'
            record['Age'] = 33
        # Add the dict to a list
        dictList.append(record)
    return dictList

datareturned = createdata()

keyname = input("For which key do you wish to search?")
valuename = input("Which values do you want to find?")
valueage = input("What is the minimum age?")

print("Full data set object size:" + str(sys.getsizeof(datareturned)))
results = search(keyname, valuename, datareturned, int(valueage))

if len(results) > 0:
    print(str(len(results)) + " found. Writing to results.txt")
    fo = open("results.txt", "w")
    for line in range(0, len(results)):
        fo.write(str(results[line]) + "\n")
    fo.close()

What is causing the massive consumption of RAM?

like image 665
Jordan Avatar asked Apr 24 '17 02:04

Jordan


1 Answers

The overhead for a dict object is quite large. It depends on your Python version and your system architechture, but on Python 3.5 64bit

In [21]: sys.getsizeof({})
Out[21]: 288

So guesstimating:

250*36e6*1e-9 == 9.0

So that is a lower limit on my ram usage in gigabytes if I created that many dictionaries, not factoring in the list!

Rather than use a dict as a record type, which isn't really the use case, use a namedtuple.

And to get a view of how this compares, let's set up an equivalent list of tuples:

In [23]: Record = namedtuple("Record", "name age")

In [24]: records = [Record("john", 28) for _ in range(36000000)]

In [25]: getsizeof = sys.getsizeof

Consider:

In [31]: sum(getsizeof(record)+ getsizeof(record.name) + getsizeof(record.age)  for record in records)
Out[31]: 5220000000

In [32]: _ + getsizeof(records)
Out[32]: 5517842208

In [33]: _ * 1e-9
Out[33]: 5.517842208

So 5 gigs is an upper limit that is quite conservative. For example, it assumes that there is no small-int caching going on, which for a record-type of ages will totally matter. On my own system, the python process is registering 2.7 gigs of memory usage (via top).

So, what is actually going on in my machine is better modeled by being conservative for strings assuming -- unique strings that have an average size of 10, so no string interning -- but liberal for ints, assuming int-caching is taking care of our int objects for us, so we just have to worry about the 8-byte pointers!

In [35]: sum(getsizeof("0123456789") + 8  for record in records)
Out[35]: 2412000000

In [36]: _ + getsizeof(records)
Out[36]: 2709842208

In [37]: _ * 1e-9
Out[37]: 2.709842208

Which is a good model for what I'm observing from top.

If you really want efficient storage

Now, if you really want to cram data into ram, you are going to have to lose the flexibility of Python. You could use the array module in combination with struct, to get C-like memory efficiency. An easier world to wade into might be numpy instead, which allows for similar things. For example:

In [1]: import numpy as np

In [2]: recordtype = np.dtype([('name', 'S20'),('age', np.uint8)])

In [3]: records = np.empty((36000000), dtype=recordtype)

In [4]: records.nbytes
Out[4]: 756000000

In [5]: records.nbytes*1e-9
Out[5]: 0.756

Note, we are now allowed to be quite compact. I can use 8-bit unsigned integers (i.e. a single byte) to represent age. However, immediately I am faced with some inflexibility: if I want efficient storage of strings I must define a maximum size. I've used 'S20', which is 20 characters. These are ASCII bytes, but a field of 20 ascii characters might very well suffice for names.

Now, numpy gives you a lot of fast methods wrapping C-compiled code. So, just to play around with it, let's fill our records with some toy data. Names will simply be string of digits from a simple count, and age will be selected from a normal distribution with a mean of 50 and a standard deviation of 10.

In [8]: for i in range(1, 36000000+1):
   ...:     records['name'][i - 1] = b"%08d" % i
   ...:

In [9]: import random
   ...: for i in range(36000000):
   ...:     records['age'][i] = max(0, int(random.normalvariate(50, 10)))
   ...:

Now, we can use numpy to query our records. For example, if you want the indices of your records given some condition, use np.where:

In [10]: np.where(records['age'] > 70)
Out[10]: (array([      58,      146,      192, ..., 35999635, 35999768, 35999927]),)

In [11]: idx = np.where(records['age'] > 70)[0]

In [12]: len(idx)
Out[12]: 643403

So 643403 records that have an age > 70. Now, let's try 100:

In [13]: idx = np.where(records['age'] > 100)[0]

In [14]: len(idx)
Out[14]: 9

In [15]: idx
Out[15]:
array([ 2315458,  5088296,  5161049,  7079762, 15574072, 17995993,
       25665975, 26724665, 28322943])

In [16]: records[idx]
Out[16]:
array([(b'02315459', 101), (b'05088297', 102), (b'05161050', 101),
       (b'07079763', 104), (b'15574073', 101), (b'17995994', 102),
       (b'25665976', 101), (b'26724666', 102), (b'28322944', 101)],
      dtype=[('name', 'S20'), ('age', 'u1')])

Of course, one major inflexibility is that numpy arrays are sized. Resizing operations are expensive. Now, you could maybe wrap a numpy.array in some class and it will act as an efficient backbone, but at that point, you might as well use a real data-base. Lucky for you, Python comes with sqlite.

like image 184
juanpa.arrivillaga Avatar answered Oct 25 '22 19:10

juanpa.arrivillaga