I have a directory of json files that I am trying to convert to a dask DataFrame and save it to castra. There are 200 files containing O(10**7) json records between them. The code is very simple largely following tutorial examples.
import dask.dataframe as dd
import dask.bag as db
import json
txt = db.from_filenames('part-*.json')
js = txt.map(json.loads)
df = js.to_dataframe()
cs=df.to_castra("data.castra")
I am running it on a 32 core machine, but the code only utilizes one core at 100%. My understanding from the docs is that this code execute in parallel. Why is it not? Did I misunderstand something?
Your final collection is a dask dataframe, which uses threads by default, you will have to explicitly tell dask to use processes.
You can do this globally
import dask
dask.config.set(scheduler='multiprocessing')
Or do this just on the to_castra
call
df.to_castra("data.castra", scheduler='multiprocessing')
Also, just as a warning, Castra was mostly an experiment. It's decently fast but also not nearly mature as something like HDF5 or Parquet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With