memory usage when indexing a large dask dataframe on a single multicore machine

Tags:

I am trying to turn the Wikipedia CirrusSearch dump into Parquet backed dask dataframe indexed by title on a 450G 16-core GCP instance. CirrusSearch dumps come as a single json line formatted file. The English Wipedia dumps contain 5M recards and are 12G compressed & 90+G expanded. An important detail is that the records are not completely flat.

The simplest way to do this would be

import json
import dask
from  dask import bag as db, dataframe as ddf
from  toolz import curried as tz
from toolz.curried import operator as op

blocksize=2**24
npartitions='auto'
parquetopts=dict(engine='fastparquet', object_encoding='json')

lang = 'en'
wiki = 'wiki'
date = 20180625
path='./'

source = f'{path}{lang}{wiki}-{date}-cirrussearch-content.json'

(
 db
 .read_text(source, blocksize=blocksize)
 .map(json.loads)
 .filter(tz.flip(op.contains, 'title'))
 .to_dataframe()
 .set_index('title', npartitions=npartitions)
 .to_parquet(f'{lang}{wiki}-{date}-cirrussearch.pq', **parquetopts)
)

The first problem is that with the default scheduler this utilizes only one core. That problem can be avoided by explictly using either the distributed or multiprocessing schedulers.

The bigger problem with all schedulers and settings I have tried is memory usage. It appears that dask tries to load the entire dataframe into memory when indexing. Even 450G is not enough RAM for this.

How can I reduce the memory usage for this task?
How can I estimate the minimum memory required without resorting to trial and error?
Is there a better approach?

873

asked Jun 29 '18 05:06

Daniel Mahler

1 Answers

Why is Dask using only one core?

The JSON parsing part of this is probably GIL-bound, you want to use processes. However when you finally compute something you're using dataframes, which generally assume that computations release the GIL (this is common in Pandas) so it uses the threading backend by default. If you are mostly bound by the GIL parsing stage then you probably want to use the multiprocessing scheduler. This should solve your problem:

dask.config.set(scheduler='multiprocessing')

How do I avoid memory use during the set_index phase

Yeah, the set_index computation requires the full dataset. This is a hard problem. If you're using the single-machine scheduler (which you appear to be doing) then it should be using an out-of-core data structure to do this sorting process. I'm surprised that it's running out of memory.

How can I estimate the minimum memory required without resorting to trial and error?

Unfortunately it's difficult to estimate the size of JSON-like data in memory in any language. This is much easier with flat schemas.

Is there a better approach?

This doesn't solve your core issue, but you might consider staging data in Parquet format before trying to sort everything. Then try doing dd.read_parquet(...).set_index(...).to_parquet(...) in isolation. This might help to isolate some costs.

answered Oct 25 '22 06:10

MRocklin

Related questions
                            
                                Embed Mongodb with Electron
                            
                                Convert CSV Data to Nested JSON in Python
                            
                                Laravel pluck with custom method from model
                            
                                How to implement graphQL sub-query
                            
                                devicemotion doesn't seem to register when device is locked
                            
                                Unable to pass a value between methods in same class in Android
                            
                                Active Directory(AD) Authentication in Azure Sql not working
                            
                                Find viable combination from a list of preferences
                            
                                How to keep order of the correlation plot labels as same in the datafile?
                            
                                Apache Spark: SparkFiles.get(fileName.txt) - Unable to retrieve the file contents from SparkContext
                            
                                Save arguments for later Javascript
                            
                                Working with both strings and substrings

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With