Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PyTables dealing with data with size many times larger than size of memory

I'm trying to understand how PyTables manage data which size is greater than memory size. Here is comment in code of PyTables (link to GitHub):

# Nodes referenced by a variable are kept in `_aliveNodes`.
# When they are no longer referenced, they move themselves
# to `_deadNodes`, where they are kept until they are referenced again
# or they are preempted from it by other unreferenced nodes.

Also useful comments can be found inside _getNode method.
It seems like PyTables have very smart IO buffering system which, as I understand, stores data referenced by user in fast RAM as "aliveNodes", keeps referenced before and presently unreferenced data as "deadNodes" for fast "reviving" it when needed, and reads data from disk if requested key is not present in both dead or alive categories.

I need some expertise about how exactly PyTables handle situations when working with data larger then available memory. My specific questions:

  1. How deadNode/aliveNode system working (common picture)?
  2. What the key difference between aliveNodes/deadNodes while they both represent data stored in RAM if im right?
  3. Can limit of RAM for buffering be adjusted manually? Below the comment, there is code which reads a value from params['NODE_CACHE_SLOTS']. Can it be somehow specified by user? For example if I want to leave some RAM for other applications that need memory too?
  4. In what situations PyTables can crash or significantly slowdown when working with big amount of data? In my case can exceed memory by 100 times, what are common pitfalls in such situations?
  5. What usage of PyTables in meaning of size, structure of data, and also manipulations with data considered as 'right' for achieving best performance?
  6. Docs suggests use .flush() after each basic .append() cycle. How long this cycle actually can be? Im performing a little benchmark, comparing SQLite and PyTables in how they can handle creating a huge table with key-value pairs from big CSV files. And when I use .flush(), less frequently in main cycle, PyTables gains huge speedup. So - is it correct, to .append() relatively big chunks of data, and then use .flush()?
like image 534
Gill Bates Avatar asked Feb 20 '13 15:02

Gill Bates


2 Answers

Memory Structure

Never used pytables but looking at the source code:

class _Deadnodes(lrucacheExtension.NodeCache):
    pass

So it looks like the _deadnodes are implemented using an LRU cache. LRU == "Least Recently Used" which means that it will throw away the least used node first. the source is here.

class _AliveNodes(dict):
    ...

Which they use as a customized dictionary of nodes that are running and represented actually in the program.

very simplified example (nodes are letters, numbers in the cache indicate how stale an entry is):

memory of 4, takes 1 time step
cache with size 2, takes 5 times steps
disk with much much more, takes 50 time steps

get node A //memory,cache miss load from disk t=50
get node B // "" t=100
get node C // "" t=150
get node D // "" t=200
get node E // "" t=250
get node A //cache hit load from cache t=255
get node F //memory, cache miss load from disk t=305
get node G //memory, cache miss load from disk t=355
get node E // in memory t=356 (everything stays the same)

t=200              t=250              t=255
Memory    CACHE    Memory    CACHE    Memory    CACHE
A                  E         A0       E         B0
B                  B                  A
C                  C                  C
D                  D                  D

t=305              t=355              
Memory    CACHE    Memory    CACHE
E         B1       E         G0
A         C0       A         C1
F                  F
D                  G

As you know in real life these structures are huge and the time it takes to access them is in bus cycles, so 1/(clock of your pc).

Comparatively the time it takes to access the elements is the same. It's pretty negligible for in memory, a little more for cache, and a whole lot more for disk. Reading from disk is the longest part of the whole process. the disk and arm needs to move, etc. It is a physical process rather than an electronic process, as in it is not happening at the speed of light.

Here in pytables they do something similar. They have written their own cache algorithm in Cython that is a middle man between the alive nodes(memory) and the full data(disk). If there is too low of a hit ratio then it looks like the cache will be turned off, and after a certain number of cycles it will turn back on again.

In parameters.py the DISABLE_EVERY_CYCLE, ENABLE EVERY_CYCLE and LOWEST_HIT_RATIO variables are used to define the number of cycles under LOWEST_HIT_RATIO to disable after and the number of cycles to wait to re-enable. Changing these values is discouraged.

The main thing that you should take from this is that if you need to do processing on a big dataset, make sure they are on the same nodes. If you can get away with it, read in a chunk, do processing on that chuck, get your results, then load another chunk. If you load chunk A, get another chunk B, then load chunk A again, this will cause the most delay. Only operate on one chunk of data at a time and keep access and writes to a minimum. Once a value is in _alivenodes it is quick to modify it, _deadnodes is a bit slower, and neither is a lot slower.

NODE_CACHE_SLOTS

params['NODE_CACHE_SLOTS'] defines the size of the set of dead nodes. Tracing it back to parameters.py it defaults to 64. It states that you can try out different values and report back. You could either change the value in the file or do:

import parameters
parameters.NODE_CACHE_SLOTS = # something else

This only limits the number of nodes kept in the cache. Past that you are limited by python's heap size, to set that see this.

append/flush

For append, flush assures the rows are output to the table. The more data you are moving with this the longer it will take for the data to move from the internal buffer to the data structure. It is calling a modified versions of the H5TBwrite_records function with other handling code. I'm guessing the the length of the call to that determines how long the output cycle.

Keep in mind this is all from the source code, and not considering any additional magic they are trying to do. I have never used pytables. In theory, it shouldn't crash, but we don't live in a theoretical world.

Edit:

Actually finding a need for pytables myself I have come across this question in their faq that might answer some of your concerns.

Thank you for exposing pytables to me, if I had come across .h5 files before researching this question I wouldn't have known what to do.

like image 189
Raufio Avatar answered Nov 11 '22 00:11

Raufio


I'm not an expert in PyTable1 but it most likely works like swap memory.

The aliveNodes live in the RAM while the deadNodes are probably stored on disk in hdf5 files (the binary file format used by PyTables). Every time you need to access a piece of data, it needs to be in the RAM. So PyTable checks if it is already there (aliveNodes) and returns it to you if it is. Otherwise, it needs to revive the deadNode where the data lives. Since the RAM is limited, it will probably kill (write to disk) an unused aliveNode to make some room beforehand.

The reason for this process is of course the limited size of the RAM. The consequence is that performances are affected every time you need to swap a node (kill a node and revive another).

To optimize performances, you should try to minimize swapping. For instance, if your data can be processed in parallel, you may be able to load each node only once. Other example: imagine that you need to loop over every element of a huge matrix which is split into a grid of nodes. Then you'd better avoid accessing its elements by row or by column but rather node by node.

Of course PyTable handles this under the hood so you don't necessary have control over what is in each node (but I encourage you to dig around this NODE_CACHE_SLOTS variable, at least to understand how it works). But generally it's faster to access data that is contiguous rather than scattered all around the place. As always, if time performance is an important issue for your application(s), profile your code.


1 Translation: I hardly know anything about PyTables

like image 34
Simon Avatar answered Nov 11 '22 01:11

Simon