PyTables vs. SQLite3 insertion speed

Tags:

I bought Kibot's stock data and it is enormous. I have about 125,000,000 rows to load (1000 stocks * 125k rows/stock [1-minute bar data since 2010-01-01], each stock in a CSV file whose fields are Date,Time,Open,High,Low,Close,Volume). I'm totally new to python (I chose it because it's free and well-supported by a community) and I chose SQLite to store the data because of python's built-in support for it. (And I know the SQL language very well. SQLiteStudio is a gem of a free program.)

My loader program is working well, but is getting slower. The SQLite db is about 6 Gb and it's only halfway loaded. I'm getting about 500k rows/hour loaded using INSERT statements and committing the transaction after each stock (approx 125k rows).

So here's the question: is PyTables substantially faster than SQLite, making the effort to learn how to use it worth it? (And since I'm in learning mode, feel free to suggest alternatives to these two.) One things that bother me about PyTables is that it's really bare bones, almost like saving a binary file, for the free version. No "where clause" functions or indexing, so you wind up scanning for the rows you need.

After I get the data loaded, I'm going to be doing statistical analysis (rolling regression & correlation, etc) using something based on NumPy: Timeseries, larry, pandas, or a scikit. I haven't chosen the analysis package yet, so if you have a recommendation, and that recommendation is best used with either PyTables or pandas (or whatever), please factor that in to your response.

(For @John) Python 2.6;
Windows XP SP3 32-bit;
Manufactured strings used as INSERT statements;
Memory usage is rock solid at 750M of the 2G physical memory;
CPU usage is 10% +/- 5%;
Totally i/o bound (disk is always crunching).
DB schema:

create table MinuteBarPrices (
    SopDate smalldatetime not null,
    Ticker  char( 5 )     not null,
    Open    real,
    High    real,
    Low     real,
    Close   real          not null,
    Volume  int,
    primary key ( SopDate, Ticker )
);
create unique index MinuteBarPrices_IE1 on MinuteBarPrices (
    Ticker,
    SopDate
);

878

asked May 21 '11 18:05

jdmarino

1 Answers

Back in 2003, a scientific paper on the comparison of PyTables and Sqlite was written by F. Altec, the author of PyTables. This shows that PyTables is usually faster, but not always.
On your point that PyTables feels 'bare bones', I would say the H5py is the bare bones way of accessing HDF5 in python, PyTables brings in all kinds of extra stuff like querying and indexing, which HDF5 doesn't natively have.

Example of querying:

 example_table = h5file.root.spamfolder.hamtable
 somendarray = hamtable.readWhere('(gender = "male") & (age>40)')

Note that PyTables PRO, which has even fancier options, has just ceased to exist, the Pro version will be free from now on. This means yet extra options to play with.

100

answered Sep 22 '22 22:09

dirkjot

Related questions
                            
                                TypeError: object.__init__() takes exactly one argument (the instance to initialize)
                            
                                How to get Conda and Virtualenv to work on mac OS Catalina?
                            
                                What does this error mean: "TypeError: Parameters to generic types must be types"?
                            
                                Binary wheel can't be uploaded on pypi using twine
                            
                                How to invoke Cloud Function from Cloud Scheduler with Authentication
                            
                                Python embedded in CPP: how to get data back to CPP
                            
                                Unicode block of a character in python
                            
                                Generating a WSDL using Python and SOAPpy
                            
                                Discussion of multiple inheritance vs Composition for a project (+other things)
                            
                                random.choice not random
                            
                                Light-weight renderer HTML with CSS in Python
                            
                                Sankey diagrams in Python
                            
                                working with negative numbers in python
                            
                                itertools.islice compared to list slice
                            
                                Why subtract a value from itself (x - x) in Python?
                            
                                How to backup an AppEngine site?
                            
                                How to change language from Django URL?
                            
                                How to write stereo wav files in Python?
                            
                                Autocorrelation of a multidimensional array in numpy
                            
                                Access untranslated content of Django's ugettext_lazy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PyTables vs. SQLite3 insertion speed

Tags:

python

sqlite

pytables

jdmarino

People also ask

1 Answers

dirkjot

Recent Activity

Donate For Us