Consider the following code:
arr = []
for (str, id, flag) in some_data:
arr.append((str, id, flag))
Imagine the input strings being 2 chars long in average and 5 chars max and some_data having 1 million elements. What will the memory requirement of such a structure be?
May it be that a lot of memory is wasted for the strings? If so, how can I avoid that?
A string in Python is just a sequence of Unicode characters enclosed within quotes. Remember that in Python there can be single quotes, double quotes, or even triple single or triple double quotes. When it comes to Python, strings are extremely efficient in terms of memory cost.
java.lang.String An empty String takes 40 bytes—enough memory to fit 20 Java characters.
In python, the usage of sys. getsizeof() can be done to find the storage size of a particular object that occupies some space in the memory. This function returns the size of the object in bytes.
In this case, because the strings are quite short, and there are so many of them, you stand to save a fair bit of memory by using intern
on the strings. Assuming there are only lowercase letters in the strings, that's 26 * 26 = 676 possible strings, so there must be a lot of repetitions in this list; intern
will ensure that those repetitions don't result in unique objects, but all refer to the same base object.
It's possible that Python already interns short strings; but looking at a number of different sources, it seems this is highly implementation-dependent. So calling intern
in this case is probably the way to go; YMMV.
As an elaboration on why this is very likely to save memory, consider the following:
>>> sys.getsizeof('')
40
>>> sys.getsizeof('a')
41
>>> sys.getsizeof('ab')
42
>>> sys.getsizeof('abc')
43
Adding single characters to a string adds only a byte to the size of the string itself, but every string takes up 40 bytes on its own.
If your strings are so short, it is likely there will be a significant number of duplicates. Python interning will optimise it so that these strings are stored only once and the reference used multiple tiems, rather than storing the string multiple times...
These strings should be automatically interned as there are.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With