HDFS performance for small files

Tags:

I'm new to Hadoop. Recently I'm trying to process (only read) many small files on hdfs/hadoop. The average file size is about 1 kb and the number of files is more than 10M. The program must be written in C++ due to some limitations.

This is just a performance evaluation so I only use 5 machines for data nodes. Each of the data node have 5 data disks.

I wrote a small C++ project to read the files directly from hard disk(not from HDFS) to build the performance base line. The program will create 4 reading threads for each disk. The performance result is to have about 14MB/s per disk. Total throughput is about 14MB/s * 5 * 5 = 350MB/s (14MB/s * 5 disks * 5 machines ).

However, when this program ( still using C++, dynamically linked to libhdfs.so, creating 4*5*5=100 threads) reads files from hdfs cluster, the throughput is about only 55MB/s.

If this programming is triggered in mapreduce (hadoop streamming, 5 jobs, each have 20 threads, total number of threads is still 100), the throughput goes down to about 45MB/s. (I guess it's slow down by some bookkeeping process).

I'm wondering what is the reasonable performance HDFS can prvoide. As you can see, comparing with native code, the data throughput is only about 1/7. Is it the problem of my config? Or HDFS limitation? Or Java limitation? What's the best way for my scenario? Will sequence file help (much)? What is the reasonable throughput comparing to native IO read we can expect?

Here's some of my config:

NameNode heap size 32G.

Job/Task node heap size 8G.

NameNode Handler Count: 128

DataNode Handler Count: 8

DataNode Maximum Number of Transfer Threads: 4096

1GBps ethernet.

Thanks.

852

asked Dec 21 '12 15:12

avhacker

2 Answers

HDFS is really not designed for many small files.

For each new file you read, the client has to talk to the namenode, which gives it the location(s) of the block(s) of the file, and then the client streams the data from the datanode.

Now, in the best case, the client does this once, and then finds that it is the machine with the data on it, and can read it directly from disk. This will be fast: comparable to direct disk reads.

If it's not the machine that has the data on it, then it must stream the data over the network. Then you are bound by network I/O speeds, which shouldn't be terrible, but still a bit slower than direct disk read.

However, you're getting an even worse case- where the overhead of talking to the namenode becomes significant. With only 1KB files, you are getting to the point where you're exchanging just as much metadata as actual data. The client has to make two separate network exchanges to get the data from each file. Add to this that the namenode is probably getting hammered by all of these different threads and so it might become a bottleneck.

So to answer your question, yes, if you use HDFS for something it's not designed to be used for, it's going to be slow. Merge your small files, and use MapReduce to get data locality, and you'll have much better performance. In fact, because you'll be able to take better advantage of sequential disk reads, I wouldn't be surprised if reading from one big HDFS file was even faster than reading many small local files.

177

answered Oct 22 '22 01:10

Joe K

just to add to whatever Joe has said, another difference between HDFS and other filesystems is that it keeps disk i/o as less as possible by storing data in larger blocks (normally 64M or 128M) as compared to traditional FS where FS block size is in the order of KBs. for that reason they always say that HDFS is good at processing few large files rather than large no of small files. the reason behind this is the fact that, although there have been significant advancements in components like cpu, ram etc in recent times, the disk i/o is an area where we are still not that much advance. this was the intention behind having so huge blocks(unlike traditional FS) and keep the usage of disk as less as possible.

moreover if the block size is too small, we will have a greater no of blocks. which means more metadata. this may again degrade the performance, as more amount of information needs to loaded into the memory. for each block, which is considered an object in HDFS has about 200B of metadata associated with it. if you have many small blocks, it'll just increase the metadata and you might end up with RAM issues.

There is very good post on Cloudera's blog section which talks about the same issue. You can visit that here.

answered Oct 22 '22 01:10

Tariq

Related questions
                            
                                What are the pros and cons of including Javascript right before the </head> tag vs the </body> tag?
                            
                                Fast algorithm for finding prime numbers? [duplicate]
                            
                                Choosing a Data structure for very large data
                            
                                F# vs. C# performance Signatures with sample code
                            
                                z-index, how does it affect performance?
                            
                                json column vs multiple columns
                            
                                PHP PDO vs normal mysqli speed performance benchmark [closed]
                            
                                haskell matrix implemetation performance
                            
                                How to approach Java 2D performance variations between different computers?
                            
                                Azure table storage performance - REST vs. StorageClient
                            
                                Overhead of DLL function call
                            
                                Cassandra and Tombstones: Creating a Row , Deleting the Row, Recreating the Row = Performance?
                            
                                Is Array.Copy() faster than for loop, for 2D arrays?
                            
                                "GLOBAL could be very inefficient"
                            
                                Why spike in query time despite similar number of rows examined?
                            
                                Index a view of a join with Postgresql?
                            
                                Stop event bubbling - increases performance?
                            
                                User space Vs Kernel space program performance difference
                            
                                Java object which takes the least memory
                            
                                Cython code 3-4 times slower than Python / Numpy code?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HDFS performance for small files

Tags:

performance

io

hadoop

hdfs

avhacker

People also ask

2 Answers

Joe K

Tariq

Recent Activity

Donate For Us