Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing millions of log files - Approx 25 TB a year

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.

I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.

As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.

The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.

Ankur

like image 342
Ankur Gupta Avatar asked Oct 09 '10 05:10

Ankur Gupta


3 Answers

Since you dont want queriying features, You can use apache hadoop.

I belive HDFS and HBase will be nice fit for this.

You can see lot of huge storage stories inside Hadoop powered by page

like image 102
RameshVel Avatar answered Sep 21 '22 07:09

RameshVel


Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.

Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.

like image 45
Jim Ferrans Avatar answered Sep 21 '22 07:09

Jim Ferrans


Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.

http://www.gluster.org/

like image 39
Nauman Avatar answered Sep 21 '22 07:09

Nauman