Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is there no 'hadoop fs -head' shell command?

Tags:

hadoop

hdfs

A fast method for inspecting files on HDFS is to use tail:

~$ hadoop fs -tail /path/to/file 

This displays the last kilobyte of data in the file, which is extremely helpful. However, the opposite command head does not appear to be part of the shell command collections. I find this very surprising.

My hypothesis is that since HDFS is built for very fast streaming reads on very large files, there is some access-oriented issue that affects head. This makes me hesitant to do things to access the head. Does anyone have an answer?

like image 676
bbengfort Avatar asked Nov 04 '13 22:11

bbengfort


People also ask

Which command is used to access Hadoop FS?

Hadoop HDFS ls Command Description: The Hadoop fs shell command ls displays a list of the contents of a directory specified in the path provided by the user. It shows the name, permissions, owner, size, and modification date for each file or directories in the specified directory.

What's the difference between Hadoop FS shell commands and HDFS dfs shell commands?

1 Answer. There IS a difference between the two, refer to the following figure from Apache's official documentation: As we can see here, the 'hdfs dfs' command is used very specifically for hadoop filesystem (hdfs) data operations while 'hadoop fs' covers a larger variety of data present on external platforms as well.

How do you access Hadoop FS?

Access the HDFS using its web UI. Open your Browser and type localhost:50070 You can see the web UI of HDFS move to utilities tab which is on the right side and click on Browse the File system, you can see the list of files which are in your HDFS.


1 Answers

I would say it's more to do with efficiency - a head can easily be replicated by piping the output of a hadoop fs -cat through the linux head command.

hadoop fs -cat /path/to/file | head 

This is efficient as head will close out the underlying stream after the desired number of lines have been output

Using tail in this manner would be considerably less efficient - as you'd have to stream over the entire file (all HDFS blocks) to find the final x number of lines.

hadoop fs -cat /path/to/file | tail 

The hadoop fs -tail command as you note works on the last kilobyte - hadoop can efficiently find the last block and skip to the position of the final kilobyte, then stream the output. Piping via tail can't easily do this.

like image 118
Chris White Avatar answered Sep 20 '22 17:09

Chris White