Choosing a framework for larger than memory data analysis with python

What I already did

I read the .csv file with dask and converted it to castra format on disk for higher performance. I also queried the musicbrainz API and populated an sqlite DB, using peewee, with some relevant results. I choose to use a DB instead of another dask.dataframe because the process took few days and I didn't want to loose data in case of any failure.

I didn't started to really analyze the data yet. I managed to made enough mess during the rearrangement of the data.

The current problem

I'm having hard time in joining the columns from the SQL DB to the dask / castra dataframe. Actually, I'm not sure if this is viable at all.

Alternative approaches

It seems that I made some mistakes in choosing the best tools for the task. Castra is probably not mature enough and I think that it's part of the problem. In addition, it may be better to choose SQLAlchemy in favor of peewee, as it used by pandas and peewee's not.

Blaze + HDF5 might serve as good alternatives to dask + castra, mainly because of HDF5 being more stable / mature / complete than castra and blaze being less opinionated regarding data storage. E.g. it may simplify the join of the SQL DB into the main dataset.

On the other hand, I'm familiar with pandas and dask expose the "same" API. With dask I also gain parallelism.

TL;DR

I'm having a larger than memory dataset + sqlite DB that I need to join into the main dataset. I'm in doubt whether to work with dask + castra (don't know of other relevant data stores for dask.dataframe), and use SQLAlchemy to load parts of the SQL DB at a time into the dataframe with pandas. The best alternative I see is to switch to blaze + HDF5 instead. What would you suggest in this case?

Any other option / opinion is welcome. I hope that this is specific enough for SO.

409

asked Oct 14 '15 15:10

Nagasaki45

1 Answers

You're correct in the following points:

Castra is experimental and immature.

If you want something more mature you could consider HDF5 or CSV (if you're fine with slow performance). Dask.dataframe supports all of these formats just in the same way that pandas does.

It is not clear how to join between two different formats like dask.dataframe and SQL.

Probably you want to use one or the other. If you're interested in reading SQL data into dask.dataframe you could raise an issue. This would not be hard to add in common situations.

104

answered Sep 25 '22 23:09

MRocklin

Related questions
                            
                                Pandas, Computing total sum on each MultiIndex sublevel
                            
                                matplotlib scatter plots do not display when populated using for loop
                            
                                NoSuchKey when getting a signed url for a cloudstorage object with a space in the name
                            
                                Rephrase spirograph code into function
                            
                                Why does the single backslash raw string in Python cause a syntax error?
                            
                                hadoop 2.4.0 streaming generic parser options using TAB as separator
                            
                                UnicodeEncodeError when running in mod_wsgi even after setting LANG and LC_ALL
                            
                                Python - Splitting a large string by number of delimiter occurrences
                            
                                Specify greater than or equal to Git tag in requirements.txt
                            
                                Interpolate each row in matrix of x values
                            
                                Numba Runtime Linking Error
                            
                                Python urlencode don't encode special characters
                            
                                MVC for standalone application using python, sqlite3 and gtk
                            
                                Understanding output from recursive function
                            
                                Generating maximum wifi activity through 1 computer
                            
                                Jupyterhub: where is get_config defined and how do I create a custom Authenticator?
                            
                                Python Pandas to_clipboard() UnicodeEncodeError: 'ascii' codec can't encode character
                            
                                What does "the choice must be consistent for all consumers" mean?
                            
                                mlock a variable in python [duplicate]
                            
                                must build Spark with Hive (spark 1.5.0)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Choosing a framework for larger than memory data analysis with python

Tags:

python

hdf5

dask

blaze