Storing millions of log files - Approx 25 TB a year

Question

As part of my work we get approx 25TB worth log files annually, currently it been saved over an NFS based filesystem. Some are archived as in zipped/tar.gz while others reside in pure text format.

I am looking for alternatives of using an NFS based system. I looked at MongoDB, CouchDB. The fact that they are document oriented database seems to make it the right fit. However the log files content needs to be changed to JSON to be store into the DB. Something I am not willing to do. I need to retain the log files content as is.

As for usage we intend to put a small REST API and allow people to get file listing, latest files, and ability to get the file.

The proposed solutions/ideas need to be some form of distributed database or filesystem at application level where one can store log files and can scale horizontally effectively by adding more machines.

Ankur

RameshVel · Accepted Answer

Since you dont want queriying features, You can use apache hadoop.

I belive HDFS and HBase will be nice fit for this.

You can see lot of huge storage stories inside Hadoop powered by page

Jim Ferrans · Answer

Take a look at Vertica, a columnar database supporting parallel processing and fast queries. Comcast used it to analyze about 15GB/day of SNMP data, running at an average rate of 46,000 samples per second, using five quad core HP Proliant servers. I heard some Comcast operations folks rave about Vertica a few weeks ago; they still really like it. It has some nice data compression techniques and "k-safety redundancy", so they could dispense with a SAN.

Update: One of the main advantages of a scalable analytics database approach is that you can do some pretty sophisticated, quasi-real time querying of the log. This might be really valuable for your ops team.

Nauman · Answer

Have you tried looking at gluster? It is scalable, provides replication and many other features. It also gives you standard file operations so no need to implement another API layer.

http://www.gluster.org/

Storing millions of log files - Approx 25 TB a year

Tags:

mongodb

couchdb

storage

distributed

logfiles

Ankur Gupta

3 Answers

RameshVel

Jim Ferrans

Nauman

Recent Activity

Donate For Us

Storing millions of log files - Approx 25 TB a year

Tags:

mongodb

couchdb

storage

distributed

logfiles

Ankur Gupta

3 Answers

RameshVel

Jim Ferrans

Nauman

Related questions

Recent Activity

Donate For Us