Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Dask Running Bag operations in parallel

I am trying to run a series of operations on a json file using Dask and read_text but I find that when I check Linux Systems Monitor, only one core is ever used at 100%. How do I know if the operations I am performing on a Dask Bag are able to be parallelized? Here is the basic layout of what I am doing:

import dask.bag as db
import json

js = db.read_text('path/to/json').map(json.loads).filter(lambda d: d['field'] == 'value')
result = js.pluck('field')
result = result.map(cleantext, tbl=tbl).str.lower().remove(exclusion).str.split()
result.map(stopwords,stop=stop).compute()

The basic premise is to extract text entries from the json file and then perform some cleaning operations. This seems like something that can be parallelized since each piece of text could be handed off to a processor since each text and the cleaning of each text is independent of any of the other. Is this an incorrect thought? Is there something I should be doing differently?

Thanks.

like image 420
Billiam Avatar asked Feb 03 '26 22:02

Billiam


1 Answers

The read_text function breaks up a file into chunks based on byte ranges. My guess is that your file is small enough to fit into one chunk. You can check this by looking at the .npartitions attribute.

>>> js.npartitions
1

If so, then you might consider reducing the blocksize to increase the number of partitions

>>> js = db.read_text(..., blocksize=1e6)...  # 1MB chunks
like image 62
MRocklin Avatar answered Feb 05 '26 13:02

MRocklin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!