I have a very simple Python script to create (for test purposes), 35 million dictionary objects within a list. Each dictionary object contains two key/value pairs. eg.
{'Name': 'Jordan', 'Age': 35}
The script very simply take a query on name and age, searches through the list of dictionaries and returns a new list containing the index of all matching dictionary entries.
However as you can see below, an insane amount of memory is consumed. I presume I am making a very naive mistake somewhere.
My code is as follows: (can also be viewed in the image if more readable).
import sys
# Firstly, we will create 35 million records in memory, all will be the same apart from one
def search(key, value, data, age):
print("Searching, please wait")
# Create list to store returned PKs
foundPKS = []
for index in range(0, len(data)):
if key in data[index] and 'Age' in data[index]:
if data[index][key] == value and data[index]['Age'] >= age:
foundPKS.append(index)
results = foundPKS
return results
def createdata():
# Let's create our list for storing our dictionaries
print("Creating database, please wait")
dictList = []
for index in range(0, 35000000):
# Define dictionary
record = {'Name': 'Jordan', 'Age': 25}
if 24500123 <= index <= 24500200:
record['Name'] = 'Chris'
record['Age'] = 33
# Add the dict to a list
dictList.append(record)
return dictList
datareturned = createdata()
keyname = input("For which key do you wish to search?")
valuename = input("Which values do you want to find?")
valueage = input("What is the minimum age?")
print("Full data set object size:" + str(sys.getsizeof(datareturned)))
results = search(keyname, valuename, datareturned, int(valueage))
if len(results) > 0:
print(str(len(results)) + " found. Writing to results.txt")
fo = open("results.txt", "w")
for line in range(0, len(results)):
fo.write(str(results[line]) + "\n")
fo.close()
What is causing the massive consumption of RAM?
The overhead for a dict
object is quite large. It depends on your Python version and your system architechture, but on Python 3.5 64bit
In [21]: sys.getsizeof({})
Out[21]: 288
So guesstimating:
250*36e6*1e-9 == 9.0
So that is a lower limit on my ram usage in gigabytes if I created that many dictionaries, not factoring in the list
!
Rather than use a dict as a record type, which isn't really the use case, use a namedtuple
.
And to get a view of how this compares, let's set up an equivalent list of tuples:
In [23]: Record = namedtuple("Record", "name age")
In [24]: records = [Record("john", 28) for _ in range(36000000)]
In [25]: getsizeof = sys.getsizeof
Consider:
In [31]: sum(getsizeof(record)+ getsizeof(record.name) + getsizeof(record.age) for record in records)
Out[31]: 5220000000
In [32]: _ + getsizeof(records)
Out[32]: 5517842208
In [33]: _ * 1e-9
Out[33]: 5.517842208
So 5 gigs is an upper limit that is quite conservative. For example, it assumes that there is no small-int caching going on, which for a record-type of ages will totally matter. On my own system, the python process is registering 2.7 gigs of memory usage (via top
).
So, what is actually going on in my machine is better modeled by being conservative for strings assuming -- unique strings that have an average size of 10, so no string interning -- but liberal for ints, assuming int-caching is taking care of our int
objects for us, so we just have to worry about the 8-byte pointers!
In [35]: sum(getsizeof("0123456789") + 8 for record in records)
Out[35]: 2412000000
In [36]: _ + getsizeof(records)
Out[36]: 2709842208
In [37]: _ * 1e-9
Out[37]: 2.709842208
Which is a good model for what I'm observing from top
.
Now, if you really want to cram data into ram, you are going to have to lose the flexibility of Python. You could use the array
module in combination with struct
, to get C-like memory efficiency. An easier world to wade into might be numpy
instead, which allows for similar things. For example:
In [1]: import numpy as np
In [2]: recordtype = np.dtype([('name', 'S20'),('age', np.uint8)])
In [3]: records = np.empty((36000000), dtype=recordtype)
In [4]: records.nbytes
Out[4]: 756000000
In [5]: records.nbytes*1e-9
Out[5]: 0.756
Note, we are now allowed to be quite compact. I can use 8-bit unsigned integers (i.e. a single byte) to represent age. However, immediately I am faced with some inflexibility: if I want efficient storage of strings I must define a maximum size. I've used 'S20'
, which is 20 characters. These are ASCII bytes, but a field of 20 ascii characters might very well suffice for names.
Now, numpy
gives you a lot of fast methods wrapping C-compiled code. So, just to play around with it, let's fill our records with some toy data. Names will simply be string of digits from a simple count, and age will be selected from a normal distribution with a mean of 50 and a standard deviation of 10.
In [8]: for i in range(1, 36000000+1):
...: records['name'][i - 1] = b"%08d" % i
...:
In [9]: import random
...: for i in range(36000000):
...: records['age'][i] = max(0, int(random.normalvariate(50, 10)))
...:
Now, we can use numpy to query our records
. For example, if you want the indices of your records given some condition, use np.where
:
In [10]: np.where(records['age'] > 70)
Out[10]: (array([ 58, 146, 192, ..., 35999635, 35999768, 35999927]),)
In [11]: idx = np.where(records['age'] > 70)[0]
In [12]: len(idx)
Out[12]: 643403
So 643403
records that have an age > 70
. Now, let's try 100
:
In [13]: idx = np.where(records['age'] > 100)[0]
In [14]: len(idx)
Out[14]: 9
In [15]: idx
Out[15]:
array([ 2315458, 5088296, 5161049, 7079762, 15574072, 17995993,
25665975, 26724665, 28322943])
In [16]: records[idx]
Out[16]:
array([(b'02315459', 101), (b'05088297', 102), (b'05161050', 101),
(b'07079763', 104), (b'15574073', 101), (b'17995994', 102),
(b'25665976', 101), (b'26724666', 102), (b'28322944', 101)],
dtype=[('name', 'S20'), ('age', 'u1')])
Of course, one major inflexibility is that numpy
arrays are sized. Resizing operations are expensive. Now, you could maybe wrap a numpy.array
in some class and it will act as an efficient backbone, but at that point, you might as well use a real data-base. Lucky for you, Python comes with sqlite
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With