Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which database should I use to store records, and how should I use it?

I'm developing an application that will store a sizeable number of records. These records will be something like (URL, date, title, source, {optional data...})

As this is a client-side app, I don't want to use a database server, I just want the info stored into files.

I want the files to be readable from various languages (at least python and C++), so something language specific like python's pickle is out of the game.

I am seeing two possibilities: sqlite and BerkeleyDB. As my use case is clearly not relational, I am tempted to go with BerkeleyDB, however I don't really know how I should use it to store my records, as it only stores key/value pairs.

Is my reasoning correct? If so, how should I use BDB to store my records? Can you link me to relevant info? Or am I missing a better solution?

like image 699
static_rtti Avatar asked Nov 08 '09 17:11

static_rtti


People also ask

Which database is best for storing data?

SQL Server SQL Server was developed by Microsoft and it's considered a great RDBMS for both on-premise and cloud environments. It has a Database Engine component that allows storing, processing, and securing data. The database engine is divided into two segments — the relational and storage engine.

What kind of database is used to store documents?

A document database (also known as a document-oriented database or a document store) is a database that stores information in documents. Document databases offer a variety of advantages, including: An intuitive data model that is fast and easy for developers to work with.

Which database is best for millions of records?

Oracle Database Oracle has provided high-quality database solutions since the 1970s. The most recent version of Oracle Database was designed to integrate with cloud-based systems, and it allows you to manage massive databases with billions of records.


3 Answers

I am seeing two possibilities: sqlite and BerkeleyDB. As my use case is clearly not relational, I am tempted to go with BerkeleyDB, however I don't really know how I should use it to store my records, as it only stores key/value pairs.

What you are describing is exactly what relational is about, even if you only need one table. SQLite will probably make this very easy to do.

EDIT: The relational model doesn't have anything to do with relationships between tables. A relation is a subset of the Cartesian product of other sets. For instance, the cartesian product of the Real numbers, Real Numbers, and Real numbers (Yes, all three the same) produce 3d coordinate space, and you could define a relation upon that space with a formula, say x*y = z. each possible set of coordinates (x0,y0,z0) are either in the relation if they satisfy the given formula, or else they are not.

A relational database uses this concept with a few additional requirements. First, and most important, the size of the relation must be finite. The product relation given above doesn't satisfy that requirement, because there are infinitely many 3-tuples that satisfy the formula. There are a number of other considerations that have more to do with what is practical or useful on real computers solving real problems.

A better way of thinking about the problem is to think about where each type of persistence mechanism specifically works better than the other. You already recognize that a relational solution makes sense when you have many separate datasets (tables) that must support relationships between them (foreign key constraints), which is almost impossible to enforce with a key-value store. Another real advantage to relational is the way it makes rich, ad-hoc queries possible with the use of proper indexes. This is a consequence of the database layer actually understanding the data that it is representing.

A key-value store has it's own set of advantages. One of the more important is the way that key-value stores scale out. It is no consequence that memcached, couchdb, hadoop all use key-value storage, because it is easy to distribute key-value lookup across multiple servers. Another area that key-value storage works well is when the key or value is opaque, such as when the stored item is encrypted, only to be readable by it's owner.


To drive this point home, that a Relational database works well even when you just don't need more than one table, consider the following (not original)

SELECT t1.actor1 
FROM workswith AS t1, 
     workswith AS t2, 
     workswith AS t3, 
     workswith AS t4, 
     workswith AS t5,
     workswith AS t6
WHERE t1.actor2 = t2.actor1 AND
      t2.actor2 = t3.actor1 AND
      t3.actor2 = t4.actor1 AND
      t4.actor2 = t5.actor1 AND
      t5.actor2 = t6.actor1 AND
      t6.actor2 = "Kevin Bacon";

Which, obviously uses a single table: workswith to compute every actor with a bacon number of 6

like image 95
SingleNegationElimination Avatar answered Sep 30 '22 04:09

SingleNegationElimination


BerkeleyDB is good, also look at the *DBM incarnations (e.g. GDBM). The big question though is: for what do you need to search? Do you need to search by that URL, by a range of URLs or the dates you list?

It is also quite possible to keep groups of records as simple files in the local filesystem, grouped by dates or search terms, &c.

Answering the "search" question is the biggest start.

As for the key/value thingy, what you need to ensure is that the KEY itself is well defined as for your lookups. If for example you need to lookup by dates sometimes and others by title, you will need to maintain a "record" row, and then possibly 2 or more "index" rows making reference to the original record. You can model nearly anything in a key/value store.

like image 45
Jé Queue Avatar answered Sep 30 '22 03:09

Jé Queue


Personally I would use sqlite anyway. It has always just worked for me (and for others I work with). When your app grows and you suddenly do want to do something a little more sophisticated, you won't have to rewrite.

On the other hand, I've seen various comments on the Python dev list about Berkely DB that suggest it's less than wonderful; you only get dict-style access (what if you want to select certain date ranges or titles instead of URLs); and it's not even in Python 3's standard set of libraries.

like image 20
andrew cooke Avatar answered Sep 30 '22 05:09

andrew cooke