I am working with an Oracle database with millions of rows and 100+ columns. I am attempting to store this data in an HDF5 file using pytables with certain columns indexed. I will be reading subsets of these data in a pandas DataFrame and performing computations.
I have attempted the following:
Download the the table, using a utility into a csv file, read the csv file chunk by chunk using pandas and append to HDF5 table using pandas.HDFStore
. I created a dtype definition and provided the maximum string sizes.
However, now when I am trying to download data directly from Oracle DB and post it to HDF5 file via pandas.HDFStore
, I run into some problems.
pandas.io.sql.read_frame does not support chunked reading. I don't have enough RAM to be able to download the entire data to memory first.
If I try to use cursor.fecthmany()
with a fixed number of records, the read operation takes ages at the DB table is not indexed and I have to read records falling under a date range. I am using DataFrame(cursor.fetchmany(), columns = ['a','b','c'], dtype=my_dtype)
however, the created DataFrame always infers the dtype rather than enforce the dtype I have provided (unlike read_csv which adheres to the dtype I provide). Hence, when I append this DataFrame to an already existing HDFDatastore
, there is a type mismatch for e.g. a float64 will maybe interpreted as int64 in one chunk.
Appreciate if you guys could offer your thoughts and point me in the right direction.
Meet Vaex. Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data.
Typically, Pandas find its' sweet spot in usage in low- to medium-sized datasets up to a few million rows. Beyond this, more distributed frameworks such as Spark or Dask are usually preferred. It is, however, possible to scale pandas much beyond this point.
The answer is YES. You can handle large datasets in python using Pandas with some techniques.
Well, the only practical solution for now is to use PyTables directly since it's designed for out-of-memory operation... It's a bit tedious but not that bad:
http://www.pytables.org/moin/HintsForSQLUsers#Insertingdata
Another approach, using Pandas, is here:
"Large data" work flows using pandas
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With