Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing a framework for larger than memory data analysis with python

I'm solving a problem with a dataset that is larger than memory. The original dataset is a .csv file. One of the columns is for track IDs from the musicbrainz service.

What I already did

I read the .csv file with dask and converted it to castra format on disk for higher performance. I also queried the musicbrainz API and populated an sqlite DB, using peewee, with some relevant results. I choose to use a DB instead of another dask.dataframe because the process took few days and I didn't want to loose data in case of any failure.

I didn't started to really analyze the data yet. I managed to made enough mess during the rearrangement of the data.

The current problem

I'm having hard time in joining the columns from the SQL DB to the dask / castra dataframe. Actually, I'm not sure if this is viable at all.

Alternative approaches

It seems that I made some mistakes in choosing the best tools for the task. Castra is probably not mature enough and I think that it's part of the problem. In addition, it may be better to choose SQLAlchemy in favor of peewee, as it used by pandas and peewee's not.

Blaze + HDF5 might serve as good alternatives to dask + castra, mainly because of HDF5 being more stable / mature / complete than castra and blaze being less opinionated regarding data storage. E.g. it may simplify the join of the SQL DB into the main dataset.

On the other hand, I'm familiar with pandas and dask expose the "same" API. With dask I also gain parallelism.

TL;DR

I'm having a larger than memory dataset + sqlite DB that I need to join into the main dataset. I'm in doubt whether to work with dask + castra (don't know of other relevant data stores for dask.dataframe), and use SQLAlchemy to load parts of the SQL DB at a time into the dataframe with pandas. The best alternative I see is to switch to blaze + HDF5 instead. What would you suggest in this case?

Any other option / opinion is welcome. I hope that this is specific enough for SO.

like image 409
Nagasaki45 Avatar asked Oct 14 '15 15:10

Nagasaki45


People also ask

How big can a Python DataFrame handle?

The long answer is the size limit for pandas DataFrames is 100 gigabytes (GB) of memory instead of a set number of cells.

Can Python handle millions of rows?

Meet Vaex. Vaex is a high-performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate basic statistics for more than a billion rows per second. It supports multiple visualizations allowing interactive exploration of big data.

Why pandas causes memory issue while reading large datasets?

Pandas cause memory issue while reading large datasets as it fails to load larger-than-memory data onto the RAM. There are various other Python libraries that do not load the large data at once but interacts with system OS to map the data with Python. Further, they utilize all the cores of the CPU to speed up the computations.

How do DASK DataFrames handle larger-than-memory datasets?

- Stack Overflow How do Dask dataframes handle larger-than-memory datasets? Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads. One dask DataFrame is comprised of several in-memory pandas DataFrames separated along the index.

Can Python process large datasets?

There are various other Python libraries that do not load the large data at once but interacts with system OS to map the data with Python. Further, they utilize all the cores of the CPU to speed up the computations. In this article, we will discuss 4 such Python libraries that can read and process large-sized datasets.

How much memory does a Python process use?

We can get this information using the handy psutil library, checking the resident memory of the current process: With this particular measurement, we’re using 3083MB, or 3.08GB, and the difference from the array size is no doubt the memory used by the Python interpreter and the libraries we’ve imported.


1 Answers

You're correct in the following points:

  • Castra is experimental and immature.

If you want something more mature you could consider HDF5 or CSV (if you're fine with slow performance). Dask.dataframe supports all of these formats just in the same way that pandas does.

  • It is not clear how to join between two different formats like dask.dataframe and SQL.

Probably you want to use one or the other. If you're interested in reading SQL data into dask.dataframe you could raise an issue. This would not be hard to add in common situations.

like image 104
MRocklin Avatar answered Sep 25 '22 23:09

MRocklin