I have a system which is receiving log files from different places through http (>10k producers, 10 logs per day, ~100 lines of text each). I would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ... My question is : what's the best way to store them ? <ul> <li>Flat text files (with proper locking), one file per uploaded file, one directory per day/producer</li> <li>Flat text files, one (big) file per day for all producers (problem here will be indexing and locking)</li> <li>Database Table with text (MySQL is preferred for internal reasons) (pb with DB purge as delete can be very long !)</li> <li>Database Table with one record per line of text</li> <li>Database with sharding (one table per day), allowing simple data purge. (this is partitioning. However the version of mysql I have access to (ie supported internally) does not support it)</li> <li>Document based DB à la couchdb or mongodb (problem could be with indexing / maturity / speed of ingestion)</li> </ul> Any advice ?

Since you would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ... You're expecting 100,000 files a day, at a total of 10,000,000 lines: I'd suggest: <ol> <li>Store all the files as regular textfiles using the following format : yyyymmdd/producerid/fileno.</li> <li>At the end of the day, clear the database, and load all the textfiles for the day.</li> <li>After loading the files, it would be easy to get the stats from the database, and post them in any format needed. (maybe even another "stats" database). You could also generate graphs.</li> <li>To save space ,you could compress the daily folder. Since they're textfiles, they would compress well.</li> </ol> So you would only be using the database to be able to easily aggregate the data. You could also reproduce the reports for an older day if the process didn't work, by going through the same steps.

Storage of many log files

4 Answers

(Disclaimer: I work on MongoDB.)

I think MongoDB is the best solution for logging. It is blazingly fast, as in, it can probably insert data faster than you can send it. You can do interesting queries on the data (e.g., ranges of dates or log levels) and index and field or combination of fields. It's also nice because you can randomly add more fields to logs ("oops, we want a stack trace field for some of these") and it won't cause problems (as it would with flat text files).

As far as stability goes, a lot of people are already using MongoDB in production (see http://www.mongodb.org/display/DOCS/Production+Deployments). We just have a few more features we want to add before we go to 1.0.

110

answered Oct 21 '22 16:10

kristina

I'd pick the very first solution.

I don't see why would you need DB at all. Seems like all you need is to scan through the data. Keep the logs in the most "raw" state, then process it and then create a tarball for each day.

The only reason to aggregate would be to reduce the number of files. On some file systems, if you put more than N files in a directory, the performance decreases rapidly. Check your filesystem and if it's the case, organize a simple 2-level hierarchy, say, using the first 2 digits of producer ID as the first level directory name.

answered Oct 21 '22 17:10

Igor Krivokon

I would write one file per upload, and one directory/day as you first suggested. At the end of the day, run your processing over the files, and then tar.bz2 the directory.

The tarball will still be searchable, and will likely be quite small as logs can usually compress quite well.

For total data, you are talking about 1GB [corrected 10MB] a day uncompressed. This will likely compress to 100MB or less. I've seen 200x compression on my log files with bzip2. You could easily store the compressed data on a file system for years without any worries. For additional processing you can write scripts which can search the compressed tarball and generate more stats.

answered Oct 21 '22 17:10

brianegge

Since you would like to store them to be able to compute misc. statistics over them nightly , export them (ordered by date of arrival or first line content) ... You're expecting 100,000 files a day, at a total of 10,000,000 lines:

I'd suggest:

Store all the files as regular textfiles using the following format : yyyymmdd/producerid/fileno.
At the end of the day, clear the database, and load all the textfiles for the day.
After loading the files, it would be easy to get the stats from the database, and post them in any format needed. (maybe even another "stats" database). You could also generate graphs.
To save space ,you could compress the daily folder. Since they're textfiles, they would compress well.

So you would only be using the database to be able to easily aggregate the data. You could also reproduce the reports for an older day if the process didn't work, by going through the same steps.

answered Oct 21 '22 17:10

Osama Al-Maadeed

Related questions
                            
                                Convert SQL query for a different database
                            
                                MySQL implementation with CUDA
                            
                                How does commit_on_success handle being nested?
                            
                                Web SQL Database or Indexed Database API ... or both?
                            
                                Remove seconds from time field?
                            
                                Storing Friends in Database for Social Network
                            
                                MySQL database connection not closed: what will happen?
                            
                                Symfony2 / Doctrine mapped superclass in the middle of class table inheritance
                            
                                SQLiteOpenHelper - how is the database created?
                            
                                Should I temporarily disable foreign key constraints? How?
                            
                                What's a zip join? Have you ever heard of that, or a pairwise join?
                            
                                Why aggregate functions in PostgreSQL do not work with boolean data type
                            
                                How suitable is opting for RethinkDB instead of traditional SQL for a JSON API? [closed]
                            
                                SetMaxOpenConns and SetMaxIdleConns
                            
                                Connecting to IBM AS400 server for database operations hangs
                            
                                How does Heroku billing dynos work exactly?
                            
                                Data integrity across the databases of different Microservices
                            
                                Flutter sqflite open existing database
                            
                                Version track, automate DB schema changes with django
                            
                                Row Level Security with Entity Framework

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Storage of many log files

Tags:

language-agnostic

database

logging

makapuf

People also ask

4 Answers

kristina

Igor Krivokon

brianegge

Osama Al-Maadeed

Recent Activity

Donate For Us