Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Library/data structure for handling huge data

I have some huge binary driver logs (around 2-5GB each, and probably around 10x as much after converting them to a readable form) and I need to write a tool that would allow me to sequentially browse, sort, search and filter them effectively (in order to find and resolve bugs).

Each log entry has few attributes like: time-stamp, type, message, some GUIDs. Entries are homogeneous, no relations, no need to store the data after "inspecting" it.

I don't really know how to handle so much data. Keeping everything in memory would be foolish, same goes for keeping the data in a flat file. I thought of using small DBMS like SQLite, but I'm not sure if it will be fast enough and I don't need many features of DMBS - only sorting and searching. I would eagerly trade space for speed in this case, if it's possible.

Is there any library (or maybe data structure) that would help me handle such amounts of data?

"Served" RDBMSs like Postgre, MSSQL, MySQL are out of the question, the tool should be easy to use anywhere without any hassle.

EDIT: Oh, and does anyone know if SQLite's ":memory" mode has any restrictions on the size of DB or will it just fill virtual memory until it's filled up completely?

like image 977
kurczak Avatar asked Aug 09 '10 18:08

kurczak


1 Answers

Check out STXXL - Standard Template Library for Extra Large Data Sets.

"The core of STXXL is an implementation of the C++ standard template library STL for external memory (out-of-core) computations, i.e., STXXL implements containers and algorithms that can process huge volumes of data that only fit on disks. While the compatibility to the STL supports ease of use and compatibility with existing applications, another design priority is high performance."

Also, if you can dedicate several computers for the task, check Hadoop. Especially HBase, Hive and MapReduce.

like image 93
Lior Kogan Avatar answered Oct 20 '22 21:10

Lior Kogan