I'm developing an application that will store a sizeable number of records. These records will be something like (URL, date, title, source, {optional data...}) As this is a client-side app, I don't want to use a database server, I just want the info stored into files. I want the files to be readable from various languages (at least python and C++), so something language specific like python's pickle is out of the game. I am seeing two possibilities: sqlite and BerkeleyDB. As my use case is clearly not relational, I am tempted to go with BerkeleyDB, however I don't really know how I should use it to store my records, as it only stores key/value pairs. Is my reasoning correct? If so, how should I use BDB to store my records? Can you link me to relevant info? Or am I missing a better solution?

BerkeleyDB is good, also look at the *DBM incarnations (e.g. GDBM). The big question though is: for what do you need to search? Do you need to search by that URL, by a range of URLs or the dates you list? It is also quite possible to keep groups of records as simple files in the local filesystem, grouped by dates or search terms, &c. Answering the "search" question is the biggest start. As for the key/value thingy, what you need to ensure is that the KEY itself is well defined as for your lookups. If for example you need to lookup by dates sometimes and others by title, you will need to maintain a "record" row, and then possibly 2 or more "index" rows making reference to the original record. You can model nearly anything in a key/value store.

Which database should I use to store records, and how should I use it?

Tags:

c++

python

database

persistence

I'm developing an application that will store a sizeable number of records. These records will be something like (URL, date, title, source, {optional data...})

As this is a client-side app, I don't want to use a database server, I just want the info stored into files.

I want the files to be readable from various languages (at least python and C++), so something language specific like python's pickle is out of the game.

I am seeing two possibilities: sqlite and BerkeleyDB. As my use case is clearly not relational, I am tempted to go with BerkeleyDB, however I don't really know how I should use it to store my records, as it only stores key/value pairs.

Is my reasoning correct? If so, how should I use BDB to store my records? Can you link me to relevant info? Or am I missing a better solution?

699

asked Nov 08 '09 17:11

static_rtti

3 Answers

I am seeing two possibilities: sqlite and BerkeleyDB. As my use case is clearly not relational, I am tempted to go with BerkeleyDB, however I don't really know how I should use it to store my records, as it only stores key/value pairs.

What you are describing is exactly what relational is about, even if you only need one table. SQLite will probably make this very easy to do.

EDIT: The relational model doesn't have anything to do with relationships between tables. A relation is a subset of the Cartesian product of other sets. For instance, the cartesian product of the Real numbers, Real Numbers, and Real numbers (Yes, all three the same) produce 3d coordinate space, and you could define a relation upon that space with a formula, say x*y = z. each possible set of coordinates (x0,y0,z0) are either in the relation if they satisfy the given formula, or else they are not.

A relational database uses this concept with a few additional requirements. First, and most important, the size of the relation must be finite. The product relation given above doesn't satisfy that requirement, because there are infinitely many 3-tuples that satisfy the formula. There are a number of other considerations that have more to do with what is practical or useful on real computers solving real problems.

A better way of thinking about the problem is to think about where each type of persistence mechanism specifically works better than the other. You already recognize that a relational solution makes sense when you have many separate datasets (tables) that must support relationships between them (foreign key constraints), which is almost impossible to enforce with a key-value store. Another real advantage to relational is the way it makes rich, ad-hoc queries possible with the use of proper indexes. This is a consequence of the database layer actually understanding the data that it is representing.

A key-value store has it's own set of advantages. One of the more important is the way that key-value stores scale out. It is no consequence that memcached, couchdb, hadoop all use key-value storage, because it is easy to distribute key-value lookup across multiple servers. Another area that key-value storage works well is when the key or value is opaque, such as when the stored item is encrypted, only to be readable by it's owner.

To drive this point home, that a Relational database works well even when you just don't need more than one table, consider the following (not original)

SELECT t1.actor1 
FROM workswith AS t1, 
     workswith AS t2, 
     workswith AS t3, 
     workswith AS t4, 
     workswith AS t5,
     workswith AS t6
WHERE t1.actor2 = t2.actor1 AND
      t2.actor2 = t3.actor1 AND
      t3.actor2 = t4.actor1 AND
      t4.actor2 = t5.actor1 AND
      t5.actor2 = t6.actor1 AND
      t6.actor2 = "Kevin Bacon";

Which, obviously uses a single table: workswith to compute every actor with a bacon number of 6

answered Sep 30 '22 04:09

SingleNegationElimination

BerkeleyDB is good, also look at the *DBM incarnations (e.g. GDBM). The big question though is: for what do you need to search? Do you need to search by that URL, by a range of URLs or the dates you list?

It is also quite possible to keep groups of records as simple files in the local filesystem, grouped by dates or search terms, &c.

Answering the "search" question is the biggest start.

As for the key/value thingy, what you need to ensure is that the KEY itself is well defined as for your lookups. If for example you need to lookup by dates sometimes and others by title, you will need to maintain a "record" row, and then possibly 2 or more "index" rows making reference to the original record. You can model nearly anything in a key/value store.

answered Sep 30 '22 03:09

Jé Queue

Personally I would use sqlite anyway. It has always just worked for me (and for others I work with). When your app grows and you suddenly do want to do something a little more sophisticated, you won't have to rewrite.

On the other hand, I've seen various comments on the Python dev list about Berkely DB that suggest it's less than wonderful; you only get dict-style access (what if you want to select certain date ranges or titles instead of URLs); and it's not even in Python 3's standard set of libraries.

answered Sep 30 '22 05:09

andrew cooke

Related questions
                            
                                Dynamic Binding in C++
                            
                                Win32 LB_GETTEXT returns garbage
                            
                                How to write my own wrapper in C#?
                            
                                Getting the name of a DLL from within the dll
                            
                                Opencv MPEG7 descriptors
                            
                                Enumerate members of a structure?
                            
                                unit tests in C++
                            
                                Internet Explorer 8 + Deflate
                            
                                Why are doubles added incorrectly in a specific Visual Studio 2008 project?
                            
                                Comparing 2 graphs created by Boost Graph Library
                            
                                XML vs Hardcoded interface?
                            
                                Can I make C++ programs for BlackBerry (examples?)
                            
                                How to Avoid DOS Attack using Berkeley Sockets in C++
                            
                                C++ game, class design and responsibilities
                            
                                const char* to LPTSTR
                            
                                Question on DLL Exporting/Importing and Extern on Windows
                            
                                Good books or tutorials for beginning Direct X with c++ [closed]
                            
                                Why do we need an inserter function call when doing a set_union for a set?
                            
                                How to get the length of IStream? C++
                            
                                In a makefile, how do I execute a command on each file name in variable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With