Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pytables vs. CSV for files that are not very large

I recently came across Pytables and find it to be very cool. It is clear that they are superior to a csv format for very large data sets. I am running some simulations using python. The output is not so large, say 200 columns and 2000 rows.

If someone has experience with both, can you suggest which format would be more convenient in the long run for such data sets that are not very large. Pytables has data manipulation capabilities and browsing of the data with Vitables, but the browser does not have as much functionality as, say Excel, which can be used for CSV. Similarly, do you find one better than the other for importing and exporting data, if working mainly in python? Is one more convenient in terms of file organization? Any comments on issues such as these would be helpful.

Thanks.

like image 403
Curious2learn Avatar asked Oct 26 '10 10:10

Curious2learn


People also ask

What is PyTables in Python?

PyTables: hierarchical datasets in Python. PyTables is a package for managing hierarchical datasets and designed to efficiently cope with extremely large amounts of data. It is built on top of the HDF5 library and the NumPy package.

How much does PyTables cost?

— PyTables 3.6.2.dev0 documentation PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data. You can download PyTables and use it for free. You can access documentation, some examples of use and presentations here.

What are the advantages of using PyTables?

One important feature of PyTables is that it optimizes memory and disk resources so that data takes much less space (specially if on-flight compression is used) than other solutions such as relational object oriented databases. You can also find more information by reading the PyTables FAQ.

How to work with a large CSV file?

You'd have to have many tricks up your sleeve to effectively work with large csv's. A popular one is to not work with csv's in the first place and instead convert your csv to a database.


3 Answers

Have you considered Numpy arrays?

PyTables are wonderful when your data is too large to fit in memory, but a 200x2000 matrix of 8 byte floats only requires about 3MB of memory. So I think PyTables may be overkill.

You can save numpy arrays to files using np.savetxt or np.savez (for compression), and can read them from files with np.loadtxt or np.load.

If you have many such arrays to store on disk, then I'd suggest using a database instead of numpy .npz files. By the way, to store a 200x2000 matrix in a database, you only need 3 table columns: row, col, value:

import sqlite3
import numpy as np

db = sqlite3.connect(':memory:')
cursor = db.cursor()
cursor.execute('''CREATE TABLE foo
                  (row INTEGER,
                   col INTEGER,
                   value FLOAT,
                   PRIMARY KEY (row,col))''')
ROWS=4
COLUMNS=6
matrix = np.random.random((ROWS,COLUMNS))
print(matrix)
# [[ 0.87050721  0.22395398  0.19473001  0.14597821  0.02363803  0.20299432]
#  [ 0.11744885  0.61332597  0.19860043  0.91995295  0.84857095  0.53863863]
#  [ 0.80123759  0.52689885  0.05861043  0.71784406  0.20222138  0.63094807]
#  [ 0.01309897  0.45391578  0.04950273  0.93040381  0.41150517  0.66263562]]

# Store matrix in table foo
cursor.executemany('INSERT INTO foo(row, col, value) VALUES (?,?,?) ',
                   ((r,c,value) for r,row in enumerate(matrix) 
                                for c,value in enumerate(row)))

# Retrieve matrix from table foo
cursor.execute('SELECT value FROM foo ORDER BY row,col')
data=zip(*cursor.fetchall())[0]
matrix2 = np.fromiter(data,dtype=np.float).reshape((ROWS,COLUMNS))
print(matrix2)
# [[ 0.87050721  0.22395398  0.19473001  0.14597821  0.02363803  0.20299432]
#  [ 0.11744885  0.61332597  0.19860043  0.91995295  0.84857095  0.53863863]
#  [ 0.80123759  0.52689885  0.05861043  0.71784406  0.20222138  0.63094807]
#  [ 0.01309897  0.45391578  0.04950273  0.93040381  0.41150517  0.66263562]]

If you have many such 200x2000 matrices, you just need one more table column to specify which matrix.

like image 69
unutbu Avatar answered Sep 25 '22 06:09

unutbu


As far as importing/exporting goes, PyTables uses a standardized file format called HDF5. Many scientific software packages (like MATLAB) have built-in support for HDF5, and the C API isn't terrible. So any data you need to export from or import to one of these languages can simply be kept in HDF5 files.

PyTables does add some attributes of its own, but these shouldn't hurt you. Of course, if you store Python objects in the file, you won't be able to read them elsewhere.

The one nice thing about CSV files is that they're human readable. However, if you need to store anything other than simple numbers in them and communicate with others, you'll have issues. I receive CSV files from people in other organizations, and I've noticed that humans aren't good at making sure things like string quoting are done correctly. It's good that Python's CSV parser is as flexible as it is. One other issue is that floating point numbers can't be stored exactly in text using decimal format. It's usually good enough, though.

like image 37
kwatford Avatar answered Sep 24 '22 06:09

kwatford


One big plus for PyTables is the storage of metadata, like variables etc. If you run the simulations more often with different parameters you the store the results as an array entry in the h5 file.

We use it to store measurement data + experiment scripts to get the data so it is all self contained.

BTW: If you need to look quickly into a hdf5 file you can use HDFView. It's a Java app for free from the HDFGroup. It's easy to install.

like image 39
mrossi Avatar answered Sep 22 '22 06:09

mrossi