I am working on a google cloud computing instance with 24 vCPUs. The code running is the following
import dask.dataframe as dd
from distributed import Client
client = Client()
#read data
logd = (dd.read_csv('vol/800000test', sep='\t', parse_dates=['Date'])
.set_index('idHttp')
.rename(columns={'User Agent Type':'UA'})
.categorize())
When I run it (and this is also the case for the posterior data analysis I am doing after loading the data) I see 11 cores being used, sometimes 4.
Is there any way to control this better and make full use of cores?
read_csv will split your files into chunks according to the chunksize parameter, at least one chunk per input file. You are reading only one file, and it seems you are getting four partitions (i.e., size < 4 * 64MB). This may be reasonable for the amount of data, and extra parallelization of many small tasks maybe would only add overhead.
Nevertheless, you could change the blocksize parameter and see what difference it has for you, or see what happens when you pass multiple files, e.g., read_csv('vol/*test')
. Alternatively, you can set the partitioning in the call to set_index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With