Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading a large table with millions of rows from Oracle and writing to HDF5

I am working with an Oracle database with millions of rows and 100+ columns. I am attempting to store this data in an HDF5 file using pytables with certain columns indexed. I will be reading subsets of these data in a pandas DataFrame and performing computations.

I have attempted the following:

Download the the table, using a utility into a csv file, read the csv file chunk by chunk using pandas and append to HDF5 table using pandas.HDFStore. I created a dtype definition and provided the maximum string sizes.

However, now when I am trying to download data directly from Oracle DB and post it to HDF5 file via pandas.HDFStore, I run into some problems.

pandas.io.sql.read_frame does not support chunked reading. I don't have enough RAM to be able to download the entire data to memory first.

If I try to use cursor.fecthmany() with a fixed number of records, the read operation takes ages at the DB table is not indexed and I have to read records falling under a date range. I am using DataFrame(cursor.fetchmany(), columns = ['a','b','c'], dtype=my_dtype) however, the created DataFrame always infers the dtype rather than enforce the dtype I have provided (unlike read_csv which adheres to the dtype I provide). Hence, when I append this DataFrame to an already existing HDFDatastore, there is a type mismatch for e.g. a float64 will maybe interpreted as int64 in one chunk.

Appreciate if you guys could offer your thoughts and point me in the right direction.

like image 722
smartexpert Avatar asked Dec 16 '13 18:12

smartexpert


People also ask

How do you process a DataFrame with billions of rows in seconds?

Meet Vaex. Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data.

Can Pandas handle billions of rows?

Typically, Pandas find its' sweet spot in usage in low- to medium-sized datasets up to a few million rows. Beyond this, more distributed frameworks such as Spark or Dask are usually preferred. It is, however, possible to scale pandas much beyond this point.

Can python handle millions of records?

The answer is YES. You can handle large datasets in python using Pandas with some techniques.


1 Answers

Well, the only practical solution for now is to use PyTables directly since it's designed for out-of-memory operation... It's a bit tedious but not that bad:

http://www.pytables.org/moin/HintsForSQLUsers#Insertingdata

Another approach, using Pandas, is here:

"Large data" work flows using pandas

like image 135
LetMeSOThat4U Avatar answered Sep 30 '22 06:09

LetMeSOThat4U