Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I store extremely large amounts of traffic data for easy retrieval?

for a traffic accounting system I need to store large amounts of datasets about internet packets sent through our gateway router (containing timestamp, user id, destination or source ip, number of bytes, etc.).

This data has to be stored for some time, at least a few days. Easy retrieval should be possible as well.

What is a good way to do this? I already have some ideas:

  • Create a file for each user and day and append every dataset to it.

    • Advantage: It's probably very fast, and data is easy to find given a consistent file layout.
    • Disadvantage: It's not easily possible to see e.g. all UDP traffic of all users.
  • Use a database

    • Advantage: It's very easy to find specific data with the right SQL query.
    • Disadvantage: I'm not sure if there is a database engine that can efficiently handle a table with possibly hundreds of millions datasets.
  • Perhaps it's possible to combine the two approaches: Using an SQLite database file for each user.

    • Advantage: It would be easy to get information for one user using SQL queries on his file.
    • Disadvantage: Getting overall information would still be difficult.

But perhaps someone else has a very good idea?

Thanks very much in advance.

like image 934
Christoph Wurm Avatar asked Feb 26 '10 18:02

Christoph Wurm


People also ask

What is the best way to store large amounts of data?

Public Cloud Cloud provides a pervasive solution when it comes to the question of data storage. It is one of the best ways to store large amounts of data. Cloud has proved to be a pertaining solution of data storage for small to medium-sized companies.

Where can I store big data?

Big data is often stored in a data lake. While data warehouses are commonly built on relational databases and contain structured data only, data lakes can support various data types and typically are based on Hadoop clusters, cloud object storage services, NoSQL databases or other big data platforms.

Which provide a means to manage large amount of data efficiently?

Data structures provide a means to manage large amounts of data efficiently for uses such as large databases and internet indexing services. Usually, efficient data structures are key to designing efficient algorithms.

Which is the most common approach to store data in a file?

Object Storage Object-based storage has emerged as a preferred method for data archiving and backing-up today's digital communications—unstructured media and web content like email, videos, image files, web pages, and sensor data produced by the Internet of Things (IoT).


1 Answers

First, get The Data Warehouse Toolkit before you do anything.

You're doing a data warehousing job, you need to tackle it like a data warehousing job. You'll need to read up on the proper design patterns for this kind of thing.

[Note Data Warehouse does not mean crazy big or expensive or complex. It means Star Schema and smart ways to handle high volumes of data that's never updated.]

  1. SQL databases are slow, but that slow is good for flexible retrieval.

  2. The filesystem is fast. It's a terrible thing for updating, but you're not updating, you're just accumulating.

A typical DW approach for this is to do this.

  1. Define the "Star Schema" for your data. The measurable facts and the attributes ("dimensions") of those facts. Your fact appear to be # of bytes. Everything else (address, timestamp, user id, etc.) is a dimension of that fact.

  2. Build the dimensional data in a master dimension database. It's relatively small (IP addresses, users, a date dimension, etc.) Each dimension will have all the attributes you might ever want to know. This grows, people are always adding attributes to dimensions.

  3. Create a "load" process that takes your logs, resolves the dimensions (times, addresses, users, etc.) and merges the dimension keys in with the measures (# of bytes). This may update the dimension to add a new user or a new address. Generally, you're reading fact rows, doing lookups and writing fact rows that have all the proper FK's associated with them.

  4. Save these load files on the disk. These files aren't updated. They just accumulate. Use a simple notation, like CSV, so you can easily bulk load them.

When someone wants to do analysis, build them a datamart.

For the selected IP address or time frame or whatever, get all the relevant facts, plus the associated master dimension data and bulk load a datamart.

You can do all the SQL queries you want on this mart. Most of the queries will devolve to SELECT COUNT(*) and SELECT SUM(*) with various GROUP BY and HAVING and WHERE clauses.

like image 174
S.Lott Avatar answered Nov 15 '22 09:11

S.Lott