Dask dataframe: Memory error with merge

Tags:

I'm playing with some github user data and was trying to create a graph of all people in the same city. To do this i need to use the merge operation in dask. Unfortunately the github user base size is 6M and it seems that the merge operation is causing the resulting dataframe to blow up. I used the following code

Click to copy

import dask.dataframe as dd
gh = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna()
st = dd.read_hdf('data/github.hd5', '/github', chunksize=5000, columns=['id', 'city']).dropna()
mrg = gh.merge(st, on='city').drop('city', axis=1)
mrg['max'] = mrg.max(axis=1)
mrg['min'] = mrg.min(axis=1)
mrg.to_castra('github')

I can merge on other criteria such as name/username using this code but i get MemoryError when i try and run the above code.

I have tried running this using sync/multiprocessing and threaded schedulers.

I'm trying to do this on a Dell Laptop i7 4core with 8GB RAM. Shouldn't dask to this operation in a chunked manner or am I getting this wrong? Is writing the code using pandas dataframe iterators the only way out?

641

asked Aug 24 '16 05:08

Prasanjit Prakash

1 Answers

Castra isn't supported anymore, so using HDF is recommended. From the comments, writing to multiple files using to_hdf() solved the memory error:

mrg.to_hdf('github-*.hdf')

Relevant documentation: https://docs.dask.org/en/latest/generated/dask.dataframe.DataFrame.to_hdf.html

173

answered Oct 07 '22 12:10

pavithraes

Related questions
                            
                                Python decoding issue with Chinese characters
                            
                                Issue with passing string from Python to C shared lib using ctypes
                            
                                Pymongo insert_many BulkWriteError
                            
                                Is there any way to retrieve tweet activity using tweepy
                            
                                Loop Counting Scoping In Python
                            
                                How do I parse and write XML using Python's ElementTree without moving namespaces around?
                            
                                Steepest descent spitting out unreasonably large values
                            
                                Python: get every possible combination of weights for a portfolio
                            
                                Python pygame error : Failed loading libpng.dylib: dlopen(libpng.dylib, 2): image not found
                            
                                Preventing multiple matches in list iteration
                            
                                Sentry django configuration - logger
                            
                                How can I execute arbitrary code via JSON and how to sanitize the input
                            
                                Adding Downloaded Fonts To Tkinter
                            
                                Method for calculating irregularly spaced accumulation points
                            
                                Writing python log files to logstash
                            
                                Tensorflow serving retrained inception
                            
                                How to properly use By class in selenium with python
                            
                                Programmatically getting list of child processes of a given PID
                            
                                How to run a migration with Python Alembic by code?
                            
                                "virtualenv is not compatible with this system or executable" using virtual env and anaconda

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Dask dataframe: Memory error with merge

Tags:

python

dask

Prasanjit Prakash

People also ask

1 Answers

pavithraes

Recent Activity

Donate For Us