what is the default directory where dask workers store results or files.?

Tags:

[mapr@impetus-i0057 latest_code_deepak]$ dask-worker 172.26.32.37:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.26.32.36:50930'
distributed.diskutils - WARNING - Found stale lock file and directory '/home/mapr/latest_code_deepak/dask-worker-space/worker-PwEseH', purging
distributed.worker - INFO -       Start worker at:   tcp://172.26.32.36:41694
distributed.worker - INFO -          Listening to:   tcp://172.26.32.36:41694
distributed.worker - INFO -              bokeh at:          172.26.32.36:8789
distributed.worker - INFO -              nanny at:         172.26.32.36:50930
distributed.worker - INFO - Waiting to connect to:    tcp://172.26.32.37:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   33.52 GB
distributed.worker - INFO -       Local Directory: /home/mapr/latest_code_deepak/dask-worker-spa                                                                 ce/worker-AkBPtM
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:    tcp://172.26.32.37:8786
distributed.worker - INFO - -------------------------------------------------

what is the default directory where a dask-worker maintains the temporary files, such as task results, or the downloaded files which was uploaded using upload_file() method from the client.?

for example:-

def my_task_running_on_dask_worker():
    //fetch the file from hdfs
    // process the file
    //store the file back into hdfs

752

asked Feb 07 '18 06:02

TheCodeCache

1 Answers

By default a dask worker places a directory in ./dask-worker-space/worker-####### where ###### is some random string for that particular worker.

You can change this location using the --local-directory keyword to the dask-worker executable.

The warning that you're seeing in this line

distributed.diskutils - WARNING - Found stale lock file and directory '/home/mapr/latest_code_deepak/dask-worker-space/worker-PwEseH', purging

says that a Dask worker noticed that the directory for another worker wasn't cleaned up, presumably because it failed in some hard way. This worker is cleaning up the space left behind from the previous worker.

Edit

You can see which worker creates which directory either by looking at the logs of each worker (They print out their local directory)

$ dask-worker localhost:8786
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:36607
...
distributed.worker - INFO -       Local Directory: /home/mrocklin/dask-worker-space/worker-ks3mljzt

Or programatically by calling client.scheduler_info()

>>> client.scheduler_info()
{'address': 'tcp://127.0.0.1:34027',
 'id': 'Scheduler-bd88dfdf-e3f7-4b39-8814-beae779248f1',
 'services': {'bokeh': 8787},
 'type': 'Scheduler',
 'workers': {'tcp://127.0.0.1:33143': {'cpu': 7.7,
    ... 
   'local_directory': '/home/mrocklin/dask-worker-space/worker-8kvk_l81',
  },
...

176

answered Jan 03 '23 07:01

MRocklin

Related questions
                            
                                Slow Dask performance on CSV date parsing?
                            
                                How can I see the data preview of Dask DataFrame?
                            
                                Using dask as for task scheduling to run machine learning models in parallel
                            
                                Do xarray or dask really support memory-mapping?
                            
                                Dask Dataframe groupby has no len()
                            
                                Dask - WARNING - Worker exceeded 95% memory budget
                            
                                Lazily create dask dataframe from generator
                            
                                Deploying a cluster of containers in Azure
                            
                                How to deal with modifying large pandas dataframes
                            
                                Parallelize loop over numpy rows
                            
                                Name columns when importing csv to dataframe in dask
                            
                                dask: how to groupby, aggregate without losing column used for groupby
                            
                                How to efficiently parallelize time series forecasting using dask?
                            
                                DASK: Typerrror: Column assignment doesn't support type numpy.ndarray whereas Pandas works fine
                            
                                Dask DataFrame: Resample over groupby object with multiple rows
                            
                                Dask reading CSV, setting partition as CSV length
                            
                                An attempt has been made to start a new process before the current process has finished its bootstrapping phase
                            
                                Converting numpy solution into dask (numpy indexing doesn't work in dask)
                            
                                dask: difference between client.persist and client.compute
                            
                                how to parallelize many (fuzzy) string comparisons using apply in Pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

what is the default directory where dask workers store results or files.?

Tags:

dask

dask-distributed

dask-delayed

TheCodeCache

People also ask

1 Answers

Edit

MRocklin

Recent Activity

Donate For Us