for a traffic accounting system I need to store large amounts of datasets about internet packets sent through our gateway router (containing timestamp, user id, destination or source ip, number of bytes, etc.). This data has to be stored for some time, at least a few days. Easy retrieval should be possible as well. What is a good way to do this? I already have some ideas: <ul> <li> Create a file for each user and day and append every dataset to it. <ul> <li>Advantage: It's probably very fast, and data is easy to find given a consistent file layout.</li> <li>Disadvantage: It's not easily possible to see e.g. all UDP traffic of all users. </li> </ul> </li> <li> Use a database <ul> <li>Advantage: It's very easy to find specific data with the right SQL query. </li> <li>Disadvantage: I'm not sure if there is a database engine that can efficiently handle a table with possibly hundreds of millions datasets.</li> </ul> </li> <li> Perhaps it's possible to combine the two approaches: Using an SQLite database file for each user. <ul> <li>Advantage: It would be easy to get information for one user using SQL queries on his file. </li> <li>Disadvantage: Getting overall information would still be difficult.</li> </ul> </li> </ul> But perhaps someone else has a very good idea? Thanks very much in advance.

First, get The Data Warehouse Toolkit before you do anything. You're doing a data warehousing job, you need to tackle it like a data warehousing job. You'll need to read up on the proper design patterns for this kind of thing. [Note Data Warehouse does not mean crazy big or expensive or complex. It means Star Schema and smart ways to handle high volumes of data that's never updated.] <ol> <li>SQL databases are slow, but that slow is good for flexible retrieval.</li> <li>The filesystem is fast. It's a terrible thing for updating, but you're not updating, you're just accumulating.</li> </ol> A typical DW approach for this is to do this. <ol> <li>Define the "Star Schema" for your data. The measurable facts and the attributes ("dimensions") of those facts. Your fact appear to be # of bytes. Everything else (address, timestamp, user id, etc.) is a dimension of that fact.</li> <li>Build the dimensional data in a master dimension database. It's relatively small (IP addresses, users, a date dimension, etc.) Each dimension will have all the attributes you might ever want to know. This grows, people are always adding attributes to dimensions.</li> <li>Create a "load" process that takes your logs, resolves the dimensions (times, addresses, users, etc.) and merges the dimension keys in with the measures (# of bytes). This may update the dimension to add a new user or a new address. Generally, you're reading fact rows, doing lookups and writing fact rows that have all the proper FK's associated with them. </li> <li>Save these load files on the disk. These files aren't updated. They just accumulate. Use a simple notation, like CSV, so you can easily bulk load them.</li> </ol> When someone wants to do analysis, build them a datamart. For the selected IP address or time frame or whatever, get all the relevant facts, plus the associated master dimension data and bulk load a datamart. You can do all the SQL queries you want on this mart. Most of the queries will devolve to <code>SELECT COUNT(*)</code> and <code>SELECT SUM(*)</code> with various <code>GROUP BY</code> and <code>HAVING</code> and <code>WHERE</code> clauses.

How should I store extremely large amounts of traffic data for easy retrieval?

Tags:

database

sqlite

storage

for a traffic accounting system I need to store large amounts of datasets about internet packets sent through our gateway router (containing timestamp, user id, destination or source ip, number of bytes, etc.).

This data has to be stored for some time, at least a few days. Easy retrieval should be possible as well.

What is a good way to do this? I already have some ideas:

Create a file for each user and day and append every dataset to it.
- Advantage: It's probably very fast, and data is easy to find given a consistent file layout.
- Disadvantage: It's not easily possible to see e.g. all UDP traffic of all users.
Use a database
- Advantage: It's very easy to find specific data with the right SQL query.
- Disadvantage: I'm not sure if there is a database engine that can efficiently handle a table with possibly hundreds of millions datasets.
Perhaps it's possible to combine the two approaches: Using an SQLite database file for each user.
- Advantage: It would be easy to get information for one user using SQL queries on his file.
- Disadvantage: Getting overall information would still be difficult.

But perhaps someone else has a very good idea?

Thanks very much in advance.

934

asked Feb 26 '10 18:02

Christoph Wurm

1 Answers

First, get The Data Warehouse Toolkit before you do anything.

You're doing a data warehousing job, you need to tackle it like a data warehousing job. You'll need to read up on the proper design patterns for this kind of thing.

[Note Data Warehouse does not mean crazy big or expensive or complex. It means Star Schema and smart ways to handle high volumes of data that's never updated.]

SQL databases are slow, but that slow is good for flexible retrieval.
The filesystem is fast. It's a terrible thing for updating, but you're not updating, you're just accumulating.

A typical DW approach for this is to do this.

Define the "Star Schema" for your data. The measurable facts and the attributes ("dimensions") of those facts. Your fact appear to be # of bytes. Everything else (address, timestamp, user id, etc.) is a dimension of that fact.
Build the dimensional data in a master dimension database. It's relatively small (IP addresses, users, a date dimension, etc.) Each dimension will have all the attributes you might ever want to know. This grows, people are always adding attributes to dimensions.
Create a "load" process that takes your logs, resolves the dimensions (times, addresses, users, etc.) and merges the dimension keys in with the measures (# of bytes). This may update the dimension to add a new user or a new address. Generally, you're reading fact rows, doing lookups and writing fact rows that have all the proper FK's associated with them.
Save these load files on the disk. These files aren't updated. They just accumulate. Use a simple notation, like CSV, so you can easily bulk load them.

When someone wants to do analysis, build them a datamart.

For the selected IP address or time frame or whatever, get all the relevant facts, plus the associated master dimension data and bulk load a datamart.

You can do all the SQL queries you want on this mart. Most of the queries will devolve to SELECT COUNT(*) and SELECT SUM(*) with various GROUP BY and HAVING and WHERE clauses.

174

answered Nov 15 '22 09:11

S.Lott

Related questions
                            
                                How to limit maximum reference of the parental node in mongodb
                            
                                Create migration file while creating table in Laravel Voyager
                            
                                Storing a deep directory tree in a database
                            
                                Collection in entity framework model is not updating
                            
                                How to design microservice database one to many?
                            
                                How to exclude multiple values of column using Django ORM?
                            
                                Using a DB dependency in FastAPI without having to pass it through a function tree
                            
                                How to Audit Database Activity without Performance and Scalability Issues?
                            
                                How far can you really go with "eventual" consistency and no transactions (aka SimpleDB)?
                            
                                Define Generic Data Model for Custom Product Types
                            
                                Pre RTree step: Divide a set of points into rectangular regions each containing one point
                            
                                Change column type from ntext to varbinary(max)
                            
                                Syncing permissions between a Database project and a database using Visual studio
                            
                                How can I access an Interbase (.IB) database using RubyOnRails?
                            
                                Firefox 3.5 support for client-side databases?
                            
                                City Country State Database [closed]
                            
                                When it is time for a table to change from MyISAM to InnoDb?
                            
                                How to calculate the size for a database connection pool?
                            
                                Automating reverse engineer database model with Visio
                            
                                What form should i use making a website [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With