I'm loading 28 GB file in hadoop hdfs using webhdfs and it takes ~25 mins to load. I tried loading same file using hdfs put and It took ~6 mins. Why there is so much difference in performance? What is recommended to use? Can somebody explain or direct me to some good link it will be really helpful. Below us the command I'm using <pre class="prettyprint"><code>curl -i --negotiate -u: -X PUT "http://$hostname:$port/webhdfs/v1/$destination_file_location/$source_filename.temp?op=CREATE&overwrite=true" </code></pre> this will redirect to a datanode address which I use in next step to write the data.

<blockquote> Hadoop provides several ways of accessing HDFS All of the following support almost all features of the filesystem - 1. FileSystem (FS) shell commands: Provides easy access of Hadoop file system operations as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS. This needs hadoop client to be installed and involves the client to write blocks directly to one Data Node. All versions of Hadoop do not support all options for copying between filesystems. 2. WebHDFS: It defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing Hadoop, Advantage being language agnostic way(curl, php etc....). WebHDFS needs access to all nodes of the cluster and when some data is read, it is transmitted from the source node directly but **there is a overhead of http ** (1)FS Shell but works agnostically and no problems with different hadoop cluster and versions. 3. HttpFS. Read and write data to HDFS in a cluster behind a firewall. Single node will act as GateWay node through which all the data will be transfered and performance wise I believe this can be even slower but preferred when needs to pull the data from public source into a secured cluster. </blockquote> So choose rightly!.. Going down the list will always be an alternative when the choice above it is not available to you.

Hdfs put VS webhdfs

Tags:

hadoop

hdfs

webhdfs

I'm loading 28 GB file in hadoop hdfs using webhdfs and it takes ~25 mins to load.

I tried loading same file using hdfs put and It took ~6 mins. Why there is so much difference in performance?

What is recommended to use? Can somebody explain or direct me to some good link it will be really helpful.

Below us the command I'm using

curl -i --negotiate -u: -X PUT "http://$hostname:$port/webhdfs/v1/$destination_file_location/$source_filename.temp?op=CREATE&overwrite=true"

this will redirect to a datanode address which I use in next step to write the data.

373

asked Jul 23 '15 07:07

chhaya vishwakarma

1 Answers

Hadoop provides several ways of accessing HDFS

All of the following support almost all features of the filesystem -

1. FileSystem (FS) shell commands: Provides easy access of Hadoop file system operations as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS.
This needs hadoop client to be installed and involves the client to write blocks directly to one Data Node. All versions of Hadoop do not support all options for copying between filesystems.

2. WebHDFS: It defines a public HTTP REST API, which permits clients to access Hadoop from multiple languages without installing Hadoop, Advantage being language agnostic way(curl, php etc....).
WebHDFS needs access to all nodes of the cluster and when some data is read, it is transmitted from the source node directly but **there is a overhead of http ** (1)FS Shell but works agnostically and no problems with different hadoop cluster and versions.

3. HttpFS. Read and write data to HDFS in a cluster behind a firewall. Single node will act as GateWay node through which all the data will be transfered and performance wise I believe this can be even slower but preferred when needs to pull the data from public source into a secured cluster.

So choose rightly!.. Going down the list will always be an alternative when the choice above it is not available to you.

110

answered Sep 27 '22 15:09

rbyndoor

Related questions
                            
                                What is version library spark supported SparkSession
                            
                                How to recursively read Hadoop files from directory using Spark?
                            
                                What is the difference between 'InputFormat, OutputFormat' & 'Stored as' in Hive?
                            
                                How to change queue of currently running hadoop job?
                            
                                Hadoop YARN vs Yarn package manager command conflict
                            
                                What is the maximum number of files allowed in a HDFS directory?
                            
                                why Hadoop is not a real-time platform
                            
                                Hive: Sum over a specified group (HiveQL)
                            
                                Search a table in all databases in hive
                            
                                copying directory from local system to hdfs java code
                            
                                using PIG to load a file
                            
                                HDFS from Java - Specifying the User
                            
                                Mapreduce Combiner
                            
                                HBase Scan Performance
                            
                                How to copy and convert parquet files to csv
                            
                                Problem with -libjars in hadoop
                            
                                In hive, is there a way to specify between which columns to add a new column to?
                            
                                how to find file from blockName in HDFS hadoop
                            
                                How can I get Zeppelin to restart cleanly on an EMR cluster?
                            
                                Hive: More clean way to SELECT AS and GROUP BY

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With