I have a file a.txt in my local, i want to move that file into hadoop file system(HDFS) like as follows,
hadoop fs -put a.txt /user/hive/warehouse/sample_db/sample_table/
What is happening in background when the file a.txt is moving from local to Hadoop location??
In the background, the source file is split into HDFS blocks, the size of which is configurable (commonly 128 MB, 64 MB by default). For fault tolerance, each block is automatically replicated by HDFS. By default, three copies of each block are written to three different DataNodes. The replication factor is user-configurable (default is three). The DataNodes are servers which are physical machines or virtual machines/cloud instances. DataNodes form the Hadoop cluster into which you write your data and on which you run your MapReduce/Hive/Pig/Impala/Mahout/etc. programs.
The DataNodes are the workers of the Hadoop cluster, the NameNodes are the masters.
When a file is to be written into HDFS, the client writing the file obtains from the NameNode a list of DataNodes that can host replicas of the first block of the file.
The client arranges a pipeline through which all bytes of data from the first block of the source file will be transmitted to all participating DataNodes. The pipeline is formed from client to first DataNode to second DataNode to final (third in our case) DataNode. The data is split into packets for transmission, and each packet is tracked until all DataNodes return acks to indicate successful replication of the data. The packets are streamed to the first DataNode in the pipeline, which stores the packet and forwards it to the second DataNode, and so on. If one or more replications fail, the infrastructure automatically constructs a new pipeline and retries the copy.
When all three DataNodes confirm successful replication, the client will advance to the next block, again request a list of host DataNodes from the NameNode, and construct a new pipeline. This process is followed until all blocks have been copied into HDFS. The final block written may be smaller than the configured block size, but all blocks from the first to the penultimate block will be of the configured block size.
Reference: Hadoop: The Definitive Guide by Tom White.
Hadoop fs -put
does not move the files from local to hadoop it just copy the files to HDFS.
When you fire hadoop fs -put
command hadoop copies that file to datanode in form of blocks and the size of block is picked from hadoop configuration.
You can specify the block size at the time of copying file using hadoop -D option from which you can specify the hadoop properties of that particular copy statement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With