Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the default directory where dask workers store results or files.?

[mapr@impetus-i0057 latest_code_deepak]$ dask-worker 172.26.32.37:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://172.26.32.36:50930'
distributed.diskutils - WARNING - Found stale lock file and directory '/home/mapr/latest_code_deepak/dask-worker-space/worker-PwEseH', purging
distributed.worker - INFO -       Start worker at:   tcp://172.26.32.36:41694
distributed.worker - INFO -          Listening to:   tcp://172.26.32.36:41694
distributed.worker - INFO -              bokeh at:          172.26.32.36:8789
distributed.worker - INFO -              nanny at:         172.26.32.36:50930
distributed.worker - INFO - Waiting to connect to:    tcp://172.26.32.37:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   33.52 GB
distributed.worker - INFO -       Local Directory: /home/mapr/latest_code_deepak/dask-worker-spa                                                                 ce/worker-AkBPtM
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to:    tcp://172.26.32.37:8786
distributed.worker - INFO - -------------------------------------------------

what is the default directory where a dask-worker maintains the temporary files, such as task results, or the downloaded files which was uploaded using upload_file() method from the client.?

for example:-

def my_task_running_on_dask_worker():
    //fetch the file from hdfs
    // process the file
    //store the file back into hdfs
like image 752
TheCodeCache Avatar asked Feb 07 '18 06:02

TheCodeCache


People also ask

What are workers in DASK?

In dask-distributed, a Worker is a Python object and node in a dask Cluster that serves two purposes, 1) serve data, and 2) perform computations. Jobs are resources submitted to, and managed by, the job queueing system (e.g. PBS, SGE, etc.). In dask-jobqueue, a single Job may include one or more Workers .

How does DASK manage memory?

Dask. distributed stores the results of tasks in the distributed memory of the worker nodes. The central scheduler tracks all data on the cluster and determines when data should be freed. Completed results are usually cleared from memory as quickly as possible in order to make room for more computation.

How many employees does DASK have?

If we start Dask using processes — as in the following code — we get 8 workers, one for each core, with each worker allotted 2 GB of memory (16 GB total / 8 workers, this will vary depending on your laptop).

How do I access DASK dashboard?

After spinning up a Dask cluster, you can use client. dashboard_link to get a link to your dashboard. If you're using the distributed scheduler for local computation, the dashboard will be served at localhost:8787. This dashboard shows the real-time status of your cluster, resources, and computations.


1 Answers

By default a dask worker places a directory in ./dask-worker-space/worker-####### where ###### is some random string for that particular worker.

You can change this location using the --local-directory keyword to the dask-worker executable.

The warning that you're seeing in this line

distributed.diskutils - WARNING - Found stale lock file and directory '/home/mapr/latest_code_deepak/dask-worker-space/worker-PwEseH', purging

says that a Dask worker noticed that the directory for another worker wasn't cleaned up, presumably because it failed in some hard way. This worker is cleaning up the space left behind from the previous worker.

Edit

You can see which worker creates which directory either by looking at the logs of each worker (They print out their local directory)

$ dask-worker localhost:8786
distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:36607
...
distributed.worker - INFO -       Local Directory: /home/mrocklin/dask-worker-space/worker-ks3mljzt

Or programatically by calling client.scheduler_info()

>>> client.scheduler_info()
{'address': 'tcp://127.0.0.1:34027',
 'id': 'Scheduler-bd88dfdf-e3f7-4b39-8814-beae779248f1',
 'services': {'bokeh': 8787},
 'type': 'Scheduler',
 'workers': {'tcp://127.0.0.1:33143': {'cpu': 7.7,
    ... 
   'local_directory': '/home/mrocklin/dask-worker-space/worker-8kvk_l81',
  },
...
like image 176
MRocklin Avatar answered Jan 03 '23 07:01

MRocklin