Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data size in memory vs. on disk

How does the RAM required to store data in memory compare to the disk space required to store the same data in a file? Or is there no generalized correlation?

For example, say I simply have a billion floating point values. Stored in binary form, that'd be 4 billion bytes or 3.7GB on disk (not including headers and such). Then say I read those values into a list in Python... how much RAM should I expect that to require?

like image 742
Aaron Leinmiller Avatar asked Apr 10 '14 21:04

Aaron Leinmiller


People also ask

Why there is difference between size and size on disk?

Size is the actual size of the file in bytes. Size on disk is the actual amount of space being taken up on the disk.

What's the difference between disk space and memory?

The term "memory" usually means RAM (Random Access Memory); RAM is hardware that allows the computer to efficiently perform more than one task at a time (i.e., multi-task). The terms "disk space" and "storage" usually refer to hard drive storage.

Is it better to have more hard drive or memory?

By adding more memory, your slow computer that struggles to perform multiple tasks at once will experience faster recall speeds. Upgrading your storage is the best solution if your computer still has an HDD, as most computers now come with an SSD due to the clear performance benefits.

What is more important RAM or disk size?

Which is more important: storage or memory? Storage and memory are both important for your computer. If you have a disk with larger storage capacity, you can store more files and programs on your computer. And with more RAM, your computer can manipulate larger digital data and run faster.


2 Answers

Python Object Data Size

If the data is stored in some python object, there will be a little more data attached to the actual data in memory.

This may be easily tested.

The size of data in various forms

It is interesting to note how, at first, the overhead of the python object is significant for small data, but quickly becomes negligible.

Here is the iPython code used to generate the plot

%matplotlib inline
import random
import sys
import array
import matplotlib.pyplot as plt

max_doubles = 10000

raw_size = []
array_size = []
string_size = []
list_size = []
set_size = []
tuple_size = []
size_range = range(max_doubles)

# test double size
for n in size_range:
    double_array = array.array('d', [random.random() for _ in xrange(n)])
    double_string = double_array.tostring()
    double_list = double_array.tolist()
    double_set = set(double_list)
    double_tuple = tuple(double_list)

    raw_size.append(double_array.buffer_info()[1] * double_array.itemsize)
    array_size.append(sys.getsizeof(double_array))
    string_size.append(sys.getsizeof(double_string))
    list_size.append(sys.getsizeof(double_list))
    set_size.append(sys.getsizeof(double_set))
    tuple_size.append(sys.getsizeof(double_tuple))

# display
plt.figure(figsize=(10,8))
plt.title('The size of data in various forms', fontsize=20)
plt.xlabel('Data Size (double, 8 bytes)', fontsize=15)
plt.ylabel('Memory Size (bytes)', fontsize=15)
plt.loglog(
    size_range, raw_size, 
    size_range, array_size, 
    size_range, string_size,
    size_range, list_size,
    size_range, set_size,
    size_range, tuple_size
)
plt.legend(['Raw (Disk)', 'Array', 'String', 'List', 'Set', 'Tuple'], fontsize=15, loc='best')
like image 78
tmthydvnprt Avatar answered Sep 21 '22 13:09

tmthydvnprt


In a plain Python list, every double-precision number requires at least 32 bytes of memory, but only 8 bytes are used to store the actual number, the rest is necessary to support the dynamic nature of Python.

The float object used in CPython is defined in floatobject.h:

typedef struct {
    PyObject_HEAD
    double ob_fval;
} PyFloatObject;

where PyObject_HEAD is a macro that expands to the PyObject struct:

typedef struct _object {
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

Therefore, every floating point object in Python stores two pointer-sized fields (so each takes 8 bytes on a 64-bit architecture) besides the 8-byte double, giving 24 bytes of heap-allocated memory per number. This is confirmed by sys.getsizeof(1.0) == 24.

This means that a list of n doubles in Python takes at least 8*n bytes of memory just to store the pointers (PyObject*) to the number objects, and each number object requires additional 24 bytes. To test it, try running the following lines in the Python REPL:

>>> import math
>>> list_of_doubles = [math.sin(x) for x in range(10*1000*1000)]

and see the memory usage of the Python interpreter (I got around 350 MB of allocated memory on my x86-64 computer). Note that if you tried:

>>> list_of_doubles = [1.0 for __ in range(10*1000*1000)]

you would obtain just about 80 MB, because all elements in the list refer to the same instance of the floating point number 1.0.

like image 37
Jan Špaček Avatar answered Sep 17 '22 13:09

Jan Špaček