Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using all cores in Dask

Tags:

dask

I am working on a google cloud computing instance with 24 vCPUs. The code running is the following

import dask.dataframe as dd
from distributed import Client
client = Client()

#read data
logd = (dd.read_csv('vol/800000test', sep='\t', parse_dates=['Date'])
         .set_index('idHttp')
         .rename(columns={'User Agent Type':'UA'})
         .categorize())

When I run it (and this is also the case for the posterior data analysis I am doing after loading the data) I see 11 cores being used, sometimes 4.

enter image description here

Is there any way to control this better and make full use of cores?

like image 704
JuanPabloMF Avatar asked Nov 09 '22 01:11

JuanPabloMF


1 Answers

read_csv will split your files into chunks according to the chunksize parameter, at least one chunk per input file. You are reading only one file, and it seems you are getting four partitions (i.e., size < 4 * 64MB). This may be reasonable for the amount of data, and extra parallelization of many small tasks maybe would only add overhead.

Nevertheless, you could change the blocksize parameter and see what difference it has for you, or see what happens when you pass multiple files, e.g., read_csv('vol/*test'). Alternatively, you can set the partitioning in the call to set_index.

like image 55
mdurant Avatar answered Dec 26 '22 18:12

mdurant