Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop put performance - large file (20gb)

I'm using hdfs -put to load a large 20GB file into hdfs. Currently the process runs @ 4mins. I'm trying to improve the write time of loading data into hdfs. I tried utilizing different block sizes to improve write speed but got the below results:

512M blocksize = 4mins;
256M blocksize = 4mins;
128M blocksize = 4mins;
64M blocksize = 4mins;

Does anyone know what the bottleneck could be and other options I could explore to improve performance of the -put cmd?

like image 222
Irvo Avatar asked Oct 24 '13 15:10

Irvo


1 Answers

20GB / 4minute comes out to about 85MB/sec. That's pretty reasonable throughput to expect from a single drive with all the overhead of HDFS protocol and network. I'm betting that is your bottleneck. Without changing your ingest process, you're not going to be able to make this magically faster.

The core problem is that 20GB is a decent amount of data and that data getting pushed into HDFS as a single stream. You are limited by disk I/O which is pretty lame given you have a large number of disks in a Hadoop cluster.. You've got a while to go to saturate a 10GigE network (and probably a 1GigE, too).

Changing block size shouldn't change this behavior, as you saw. It's still the same amount of data off disk into HDFS.

I suggest you split the file up into 1GB files and spread them over multiple disks, then push them up with -put in parallel. You might want even want to consider splitting these files over multiple nodes if network becomes a bottleneck. Can you change the way you receive your data to make this faster? Obvious splitting the file and moving it around will take time, too.

like image 149
Donald Miner Avatar answered Sep 19 '22 12:09

Donald Miner