Python Dask Running Bag operations in parallel

Question

I am trying to run a series of operations on a json file using Dask and read_text but I find that when I check Linux Systems Monitor, only one core is ever used at 100%. How do I know if the operations I am performing on a Dask Bag are able to be parallelized? Here is the basic layout of what I am doing:

import dask.bag as db
import json

js = db.read_text('path/to/json').map(json.loads).filter(lambda d: d['field'] == 'value')
result = js.pluck('field')
result = result.map(cleantext, tbl=tbl).str.lower().remove(exclusion).str.split()
result.map(stopwords,stop=stop).compute()

The basic premise is to extract text entries from the json file and then perform some cleaning operations. This seems like something that can be parallelized since each piece of text could be handed off to a processor since each text and the cleaning of each text is independent of any of the other. Is this an incorrect thought? Is there something I should be doing differently?

Thanks.

MRocklin · Accepted Answer

The read_text function breaks up a file into chunks based on byte ranges. My guess is that your file is small enough to fit into one chunk. You can check this by looking at the .npartitions attribute.

>>> js.npartitions
1

If so, then you might consider reducing the blocksize to increase the number of partitions

>>> js = db.read_text(..., blocksize=1e6)...  # 1MB chunks

Python Dask Running Bag operations in parallel

Tags:

python

json

python-3.x

parallel-processing

dask

Billiam

1 Answers

MRocklin

Recent Activity

Donate For Us

Python Dask Running Bag operations in parallel

Tags:

python

json

python-3.x

parallel-processing

dask

Billiam

1 Answers

MRocklin

Related questions

Recent Activity

Donate For Us