Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Threads in Hadoop

Am confused in using few parameters in hdfs-site.xml,

dfs.namenode.handler.count - The number of server threads for the namenode. dfs.datanode.handler.count - The number of server threads for the datanode. dfs.datanode.max.transfer.threads - Specifies the maximum number of threads to use for transferring data in and out of the DN.

I have set the 'default datanode handler' count as '10', where as 'dfs.datanode.max.transfer.threads' is set as '4096'.

 lscpu

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz
Stepping:              4
CPU MHz:               2000.000
BogoMIPS:              4000.00
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              20480K
NUMA node0 CPU(s):     0,1

My confusion is

1) I have 2 CPU's, As per my understanding my system will be able to serve 2 threads at a time, what is the use in setting the datanode / namenode handler to higher value like '10'?

2) What is the difference between handler count and maximum transfer thread both are used for processing?

Thanks, Harry

like image 437
Harry Avatar asked Aug 06 '16 07:08

Harry


1 Answers

1) I have 2 CPU's, As per my understanding my system will be able to serve 2 threads at a time, what is the use in setting the datanode / namenode handler to higher value like '10'?

Most of the time, these threads will be blocked(asleep) waiting IO operation. Assume on average 1 thread is asleep for 99.9% of the time, then it only consumes 0.1% cpu. You can easily run 1000 threads at the same time. In production, the threads configuration should be based on the cluster setup (pysical cores per node, disk throughput, network throughput, workload, etc.) If you are not sure, just use the default values.

2) What is the difference between handler count and maximum transfer thread both are used for processing?

dfs.datanode.handler.count is the handler threads for ClientDatanodeProtocol, which is used for client/DN RPC communicates information about block recovery meta info. The message size is small and the transfer is fast, the handler will be idle for most of the time, so we don't need much handlers. We can easily reuse the idle one. So the default value is 10 which is quite smaller than transfer.threads.

dfs.datanode.max.transfer.threads is the number of DataXceiver threads, which is used for transfering blocks via the DTP (data transfer protocol). The block data is big and the transfer takes some time. 1 thread will be served for one block reading. only until the whole block is transferred, the thread can be reused. If there's many clients request block at the same time, we need more threads. For each write connection, there will be 2 threads. So this number should be larger for write bound applications.

Actually DataXceiver threads will be blocked waiting for reading from disk or waiting for sending data through interface. So it doesn't consume much cpu except data checksums computation.

like image 73
waltersu Avatar answered Sep 18 '22 05:09

waltersu