Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory consumption of python lists of lists

Tags:

python

memory

A code I was recently working on was found to be using around 200MB of memory to run, and I'm stumped as to why it would need that much.

Basically it mapped a text file onto a list where each character in the file was its own list containing the character and how often it has shown up so far (starting from zero) as its two items.

So 'abbac...' would be [['a','0'],['b','0'],['b','1'],['a','1'],['c','0'],...]

For a text file 1 million characters long, it used 200MB.

Is this reasonable or was it something else my code was doing? If it is reasonable, was it because of the high number of lists? Would [a,0,b,0,b,1,a,1,c,0...] take up substantially less space?

like image 333
user2998454 Avatar asked Oct 19 '22 13:10

user2998454


1 Answers

If you do not need the list itself, then I fully subscribe @Lattyware's solution of using a generator.

However, if that's not an option then perhaps you could compress the data in your list without loss of information by storing only the positions for each character in the file.

import random
import string

def track_char(s):
    # Make sure all characters have the same case
    s = s.lower()
    d = dict((k, []) for k in set(s))
    for position, char in enumerate(s):
         d[char].append(position)
    return d

st = ''.join(random.choice(string.ascii_uppercase) for _ in range(50000))
d = track_char(st)

len(d["a"])

# Total number of occurrences of character 2
for char, vals in d.items():
    if 2 in vals:
         print("Character %s has %s occurrences" % (char,len(d[char]))
Character C has 1878 occurrences

# Number of occurrences of character 2 so far
for char, vals in d.items():
    if 2 in vals:
        print("Character %s has %s occurrences so far" % (char, len([x for x in d[char] if x <= 2))
Character C has 1 occurrences so far

This way there is no need to duplicate the character string each time there is an occurrence, and you maintain the information of all their occurrences.

To compare the object size of your original list or this approach, here's a test

import random
import string
from sys import getsizeof

# random generation of a string with 50k characters
st = ''.join(random.choice(string.ascii_uppercase) for _ in range(50000))

# Function that returns the original list for this string
def original_track(s):
    l = []
    for position, char in enumerate(s):
        l.append([char, position])
    return l

# Testing sizes
original_list = original_track(st)
dict_format = track_char(st)

getsizeof(original_list)
406496
getsizeof(dict_format)
1632

As you can see, the dict_format is roughly 250x times smaller in size. However this difference in sizes should be more pronounced in larger strings.

like image 188
ODiogoSilva Avatar answered Oct 22 '22 23:10

ODiogoSilva