Experience with using h5py to do analytical work on big data in Python?

Tags:

I do a lot of statistical work and use Python as my main language. Some of the data sets I work with though can take 20GB of memory, which makes operating on them using in-memory functions in numpy, scipy, and PyIMSL nearly impossible. The statistical analysis language SAS has a big advantage here in that it can operate on data from hard disk as opposed to strictly in-memory processing. But, I want to avoid having to write a lot of code in SAS (for a variety of reasons) and am therefore trying to determine what options I have with Python (besides buying more hardware and memory).

I should clarify that approaches like map-reduce will not help in much of my work because I need to operate on complete sets of data (e.g. computing quantiles or fitting a logistic regression model).

Recently I started playing with h5py and think it is the best option I have found for allowing Python to act like SAS and operate on data from disk (via hdf5 files), while still being able to leverage numpy/scipy/matplotlib, etc. I would like to hear if anyone has experience using Python and h5py in a similar setting and what they have found. Has anyone been able to use Python in "big data" settings heretofore dominated by SAS?

EDIT: Buying more hardware/memory certainly can help, but from an IT perspective it is hard for me to sell Python to an organization that needs to analyze huge data sets when Python (or R, or MATLAB etc) need to hold data in memory. SAS continues to have a strong selling point here because while disk-based analytics may be slower, you can confidently deal with huge data sets. So, I am hoping that Stackoverflow-ers can help me figure out how to reduce the perceived risk around using Python as a mainstay big-data analytics language.

422

asked Feb 02 '11 07:02

Josh Hemann

2 Answers

We use Python in conjunction with h5py, numpy/scipy and boost::python to do data analysis. Our typical datasets have sizes of up to a few hundred GBs.

HDF5 advantages:

data can be inspected conveniently using the h5view application, h5py/ipython and the h5* commandline tools
APIs are available for different platforms and languages
structure data using groups
annotating data using attributes
worry-free built-in data compression
io on single datasets is fast

HDF5 pitfalls:

Performance breaks down, if a h5 file contains too many datasets/groups (> 1000), because traversing them is very slow. On the other side, io is fast for a few big datasets.
Advanced data queries (SQL like) are clumsy to implement and slow (consider SQLite in that case)
HDF5 is not thread-safe in all cases: one has to ensure, that the library was compiled with the correct options
changing h5 datasets (resize, delete etc.) blows up the file size (in the best case) or is impossible (in the worst case) (the whole h5 file has to be copied to flatten it again)

104

answered Oct 05 '22 23:10

Bernhard Kausler

I don't use Python for stats and tend to deal with relatively small datasets, but it might be worth a moment to check out the CRAN Task View for high-performance computing in R, especially the "Large memory and out-of-memory data" section.

Three reasons:

you can mine the source code of any of those packages for ideas that might help you generally
you might find the package names useful in searching for Python equivalents; a lot of R users are Python users, too
under some circumstances, it might prove convenient to just link to R for a particular analysis using one of the above-linked packages and then draw the results back into Python

Again, I emphasize that this is all way out of my league, and it's certainly possible that you might already know all of this. But perhaps this will prove useful to you or someone working on the same problems.

answered Oct 06 '22 01:10

Matt Parker

Related questions
                            
                                Please explain "Task was destroyed but it is pending!"
                            
                                How can I find the full path to a font from its display name on a Mac?
                            
                                A comparison between fastparquet and pyarrow?
                            
                                Best way to encode tuples with json
                            
                                Matplotlib savefig with a legend outside the plot
                            
                                Celery Worker Database Connection Pooling
                            
                                If x is list, why does x += "ha" work, while x = x + "ha" throws an exception?
                            
                                What does "Symbol not found / Expected in: flat namespace" actually mean?
                            
                                Python 3 dictionary with known keys typing
                            
                                Cross-platform desktop notifier in Python
                            
                                Convert JSON to SQLite in Python - How to map json keys to database columns properly?
                            
                                How to export figures to files from IPython Notebook
                            
                                Visual Studio Code: run Python file with arguments
                            
                                Python: Semantic similarity score for Strings [duplicate]
                            
                                Pyodbc - "Data source name not found, and no default driver specified"
                            
                                Why would running scheduled tasks with Celery be preferable over crontab?
                            
                                Stop Pandas from converting int to float
                            
                                Memory Efficient Alternatives to Python Dictionaries
                            
                                How to convert XSD to Python Class
                            
                                Meaning of "with" statement without "as" keyword

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Experience with using h5py to do analytical work on big data in Python?

Tags:

python

hdf5

scipy

h5py

sas

Josh Hemann

People also ask

2 Answers

Bernhard Kausler

Matt Parker

Recent Activity

Donate For Us