How does the RAM required to store data in memory compare to the disk space required to store the same data in a file? Or is there no generalized correlation?
For example, say I simply have a billion floating point values. Stored in binary form, that'd be 4 billion bytes or 3.7GB on disk (not including headers and such). Then say I read those values into a list in Python... how much RAM should I expect that to require?
Size is the actual size of the file in bytes. Size on disk is the actual amount of space being taken up on the disk.
The term "memory" usually means RAM (Random Access Memory); RAM is hardware that allows the computer to efficiently perform more than one task at a time (i.e., multi-task). The terms "disk space" and "storage" usually refer to hard drive storage.
By adding more memory, your slow computer that struggles to perform multiple tasks at once will experience faster recall speeds. Upgrading your storage is the best solution if your computer still has an HDD, as most computers now come with an SSD due to the clear performance benefits.
Which is more important: storage or memory? Storage and memory are both important for your computer. If you have a disk with larger storage capacity, you can store more files and programs on your computer. And with more RAM, your computer can manipulate larger digital data and run faster.
If the data is stored in some python object, there will be a little more data attached to the actual data in memory.
This may be easily tested.
It is interesting to note how, at first, the overhead of the python object is significant for small data, but quickly becomes negligible.
Here is the iPython code used to generate the plot
%matplotlib inline
import random
import sys
import array
import matplotlib.pyplot as plt
max_doubles = 10000
raw_size = []
array_size = []
string_size = []
list_size = []
set_size = []
tuple_size = []
size_range = range(max_doubles)
# test double size
for n in size_range:
double_array = array.array('d', [random.random() for _ in xrange(n)])
double_string = double_array.tostring()
double_list = double_array.tolist()
double_set = set(double_list)
double_tuple = tuple(double_list)
raw_size.append(double_array.buffer_info()[1] * double_array.itemsize)
array_size.append(sys.getsizeof(double_array))
string_size.append(sys.getsizeof(double_string))
list_size.append(sys.getsizeof(double_list))
set_size.append(sys.getsizeof(double_set))
tuple_size.append(sys.getsizeof(double_tuple))
# display
plt.figure(figsize=(10,8))
plt.title('The size of data in various forms', fontsize=20)
plt.xlabel('Data Size (double, 8 bytes)', fontsize=15)
plt.ylabel('Memory Size (bytes)', fontsize=15)
plt.loglog(
size_range, raw_size,
size_range, array_size,
size_range, string_size,
size_range, list_size,
size_range, set_size,
size_range, tuple_size
)
plt.legend(['Raw (Disk)', 'Array', 'String', 'List', 'Set', 'Tuple'], fontsize=15, loc='best')
In a plain Python list, every double-precision number requires at least 32 bytes of memory, but only 8 bytes are used to store the actual number, the rest is necessary to support the dynamic nature of Python.
The float object used in CPython is defined in floatobject.h:
typedef struct {
PyObject_HEAD
double ob_fval;
} PyFloatObject;
where PyObject_HEAD
is a macro that expands to the PyObject
struct:
typedef struct _object {
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
} PyObject;
Therefore, every floating point object in Python stores two pointer-sized fields (so each takes 8 bytes on a 64-bit architecture) besides the 8-byte double, giving 24 bytes of heap-allocated memory per number. This is confirmed by sys.getsizeof(1.0) == 24
.
This means that a list of n
doubles in Python takes at least 8*n
bytes of memory just to store the pointers (PyObject*
) to the number objects, and each number object requires additional 24 bytes. To test it, try running the following lines in the Python REPL:
>>> import math
>>> list_of_doubles = [math.sin(x) for x in range(10*1000*1000)]
and see the memory usage of the Python interpreter (I got around 350 MB of allocated memory on my x86-64 computer). Note that if you tried:
>>> list_of_doubles = [1.0 for __ in range(10*1000*1000)]
you would obtain just about 80 MB, because all elements in the list refer to the same instance of the floating point number 1.0
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With