Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Serve static files from Hadoop

My job is to design a distributed system for static image/video files. The size of the data is about tens of Terabytes. It's mostly for HTTP access (thus no processing on data; or only simple processing such as resizing- however it's not important because it can be done directly in the application).

To be a little more clear, it's a system that:

  1. Must be distributed (horizontal scale), because the total size of data is very big.
  2. Primarily serves small static files (such as images, thumbnails, short videos) via HTTP.
  3. Generally, no requirement on processing the data (thus MapReduce is not needed)
  4. Setting HTTP access on the data could be done easily.
  5. (Should have) good throughput.

I am considering:

  1. Native network file system: But it seems not feasible because the data can not fit into one machine.

  2. Hadoop filesystem. I worked with Hadoop mapreduce before, but I have no experience using Hadoop as a static file repository for HTTP requests. So I don't know if it's possible or if it's a recommended way.

  3. MogileFS. It seems promising, but I feel that using MySQL to manage local files (on a single machine) will create too much overhead.

Any suggestion please?

like image 288
hiro Avatar asked Dec 12 '22 14:12

hiro


1 Answers

I am the author of Weed-FS. For your requirement, WeedFS is ideal. Hadoop can not handle many small files, in addition to your reasons, each file needs to have an entry in the master. If the number of files are big, the hdfs master node can not scale.

Weed-FS is getting faster when compiled with latest Golang releases.

Many new improvements have been done on Weed-FS recently. Now you can test and compare very easily with the built-in upload tool. This one upload all files recursively under a directory.

weed upload -dir=/some/directory

Now you can compare by "du -k /some/directory" to see the disk usage, and "ls -l /your/weed/volume/directory" to see the Weed-FS disk usage.

And I suppose you would need replication with data center, rack aware, etc. They are in now!

like image 181
chrislusf Avatar answered Jan 08 '23 01:01

chrislusf