Reading a large table with millions of rows from Oracle and writing to HDF5

Tags:

I am working with an Oracle database with millions of rows and 100+ columns. I am attempting to store this data in an HDF5 file using pytables with certain columns indexed. I will be reading subsets of these data in a pandas DataFrame and performing computations.

I have attempted the following:

Download the the table, using a utility into a csv file, read the csv file chunk by chunk using pandas and append to HDF5 table using pandas.HDFStore. I created a dtype definition and provided the maximum string sizes.

However, now when I am trying to download data directly from Oracle DB and post it to HDF5 file via pandas.HDFStore, I run into some problems.

pandas.io.sql.read_frame does not support chunked reading. I don't have enough RAM to be able to download the entire data to memory first.

If I try to use cursor.fecthmany() with a fixed number of records, the read operation takes ages at the DB table is not indexed and I have to read records falling under a date range. I am using DataFrame(cursor.fetchmany(), columns = ['a','b','c'], dtype=my_dtype) however, the created DataFrame always infers the dtype rather than enforce the dtype I have provided (unlike read_csv which adheres to the dtype I provide). Hence, when I append this DataFrame to an already existing HDFDatastore, there is a type mismatch for e.g. a float64 will maybe interpreted as int64 in one chunk.

Appreciate if you guys could offer your thoughts and point me in the right direction.

722

asked Dec 16 '13 18:12

smartexpert

1 Answers

Well, the only practical solution for now is to use PyTables directly since it's designed for out-of-memory operation... It's a bit tedious but not that bad:

http://www.pytables.org/moin/HintsForSQLUsers#Insertingdata

Another approach, using Pandas, is here:

"Large data" work flows using pandas

135

answered Sep 30 '22 06:09

LetMeSOThat4U

Related questions
                            
                                AttributeError: '_MainProcess' object has no attribute '_exiting'
                            
                                Performance difference between urllib2 and asyncore
                            
                                cx_Freeze missing modules error
                            
                                Numpy fancy indexing and assignment
                            
                                Vim: How to indent to an open paren or bracket when hitting enter?
                            
                                Sphinx: list of functions in a module
                            
                                Drawing upon openstreetmap in python
                            
                                Is there a way to append to an existing gnome canvas bpath in Python?
                            
                                Language Localization of Khan Academy
                            
                                diagnosing memory leak in python
                            
                                Why do I get IOErrors when writing Unicode to the CMD? (With codepage 65001)
                            
                                PyQt4 & Windows 7 Thumbnail Toolbar
                            
                                NamedTuples in Jinja2 template macros
                            
                                Find and replace text in .docx file - Python
                            
                                Distributing a python package along with module dependencies using RPM
                            
                                What is the right way to use django-allauth with tastypie?
                            
                                Adapting celery.task.http.URL for tornado
                            
                                Numpy C++ program always gives segfault (most likely misuse of syntax or types)
                            
                                In matplotlib, why is it faster to plot with thinner lines?
                            
                                Why is SciPy acting very differently in IPython and Python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reading a large table with millions of rows from Oracle and writing to HDF5

Tags:

python

pandas

hdf5

pytables

smartexpert

People also ask

1 Answers

LetMeSOThat4U

Recent Activity

Donate For Us