I am thinking about creating an open source data management web application for various types of data.
A privileged user must be able to
These will be common tasks for the privileged user. He will do this through the web interface of the application. In the end, all data must be searchable and sortable by all types of users of the application. Two questions trouble me:
a) How should the data be stored in the database? Should I dynamically add/remove database tables and/or columns during runtime?
I am no database expert. I am stuck with the imagination that in terms of relational databases, the application has to be able to dynamically add/remove tables (entities) and/or columns (properties) at runtime. And I don't like this idea. Likewise, I am thinking if such dynamic data should be handled in a NoSQL database.
Anyway, I believe that this kind of problem has an intelligent canonical solution, which I just did not find and think of so far. What is the best approach for this kind of dynamic data management?
b) How to implement this in Python using an ORM or NoSQL?
If you recommend using a relational database model, then I would like to use SQLAlchemy. However, I don't see how to dynamically create tables/columns with an ORM at runtime. This is one of the reasons why I hope that there is a much better approach than creating tables and columns during runtime. Is the recommended database model efficiently implementable with SQLAlchemy?
If you recommend using a NoSQL database, which one? I like using Redis -- can you imagine an efficient implementation based on Redis?
Thanks for your suggestions!
Edit in response to some comments:
The idea is that all instances ("rows") of a certain entity ("table") share the same set of properties/attributes ("columns"). However, it will be perfectly valid if certain instances have an empty value for certain properties/attributes.
Basically, users will search the data through a simple form on a website. They query for e.g. all instances of an entity E with property P having a value V higher than T. The result can be sorted by the value of any property.
The datasets won't become too large. Hence, I think even the stupidest approach would still lead to a working system. However, I am an enthusiast and I'd like to apply modern and appropriate technology as well as I'd like to be aware of theoretical bottlenecks. I want to use this project in order to gather experience in designing a "Pythonic", state-of-the-art, scalable, and reliable web application.
I see that the first comments tend to recommending a NoSQL approach. Although I really like Redis, it looks like it would be stupid not to take advantage of the Document/Collection model of Mongo/Couch. I've been looking into mongodb and mongoengine for Python. By doing so, do I take steps into the right direction?
Edit 2 in response to some answers/comments:
From most of your answers, I conclude that the dynamic creation/deletion of tables and columns in the relational picture is not the way to go. This already is valuable information. Also, one opinion is that the whole idea of the dynamic modification of entities and properties could be bad design.
As exactly this dynamic nature should be the main purpose/feature of the application, I don't give up on this. From the theoretical point of view, I accept that performing operations on a dynamic data model must necessarily be slower than performing operations on a static data model. This is totally fine.
Expressed in an abstract way, the application needs to manage
I am looking for an intelligent and efficient way to implement this. From your answers, it looks like NoSQL is the way to go here, which is another important conclusion.
Popular dynamic schema databases include HarperDB and MongoDB.
A dynamic database management system (dynamic DBMS) is a database in which objects have a value-based relationship, which is specified at retrieval time. In a dynamic DBMS, the locations of logical file databases and relational data based are value-based.
NoSQL databases are purpose built for specific data models and have flexible schemas for building modern applications. NoSQL databases are widely recognized for their ease of development, functionality, and performance at scale.
Key Features of MongoDB MongoDB is a document type of database that uses JSON format for the data store. The following are some key features of MongoDB. Dynamic Schema: MongoDB supports dynamic schemas. In other words, we need not define the schema before the insertion of data.
The SQL or NoSQL choice is not your problem. You need to read little more about database design in general. As you said, you're not a database expert(and you don't need to be), but you absolutely must study a little more the RDBMS paradigm.
It's a common mistake for amateur enthusiasts to choose a NoSQL solution. Sometimes NoSQL is a good solution, most of the times is not.
Take for example MongoDB, which you mentioned(and is one of the good NoSQL solutions I've tried). Schema-less, right? Err.. not exactly. You see when something is schema-less means no constraints, validation, etc. But your application's models/entities can't stand on thin air! Surely there will be some constraints and validation logic which you will implement on your software layer. So I give you mongokit! I will just quote from the project's description this tiny bit
MongoKit brings structured schema and validation layer on top of the great pymongo driver
Hmmm... unstructured became structured.
At least we don't have SQL right? Yeah, we don't. We have a different query language which is of course inferior to SQL. At least you don't need to resort to map/reduce for basic queries(see CouchDB).
Don't get me wrong, NoSQL(and especially MongoDB) has its purpose, but most of the times these technologies are used for the wrong reason.
Also, if you care about serious persistence and data integrity forget about NoSQL solutions. All these technologies are too experimental to keep your serious data. By researching a bit who(except Google/Amazon) uses NoSQL solutions and for what exactly, you will find that almost no one uses it for keeping their important data. They mostly use them for logging, messages and real time data. Basically anything to off-load some burden from their SQL db storage.
Redis, in my opinion, is probably the only project who is going to survive the NoSQL explosion unscathed. Maybe because it doesn't advertise itself as NoSQL, but as a key-value store, which is exactly what it is and a pretty damn good one! Also they seem serious about persistence. It is a swiss army knife, but not a good solution to replace entirely your RDBMS.
I am sorry, I said too much :)
So here is my suggestion:
1) Study the RDBMS model a bit.
2) Django is a good framework if most of your project is going to use an RDBMS.
3) Postgresql rocks! Also keep in mind that version 9.2 will bring native JSON support. You could dump all your 'dynamic' properties in there and you could use a secondary storage/engine to perform queries(map/reduce) on said properties. Have your cake and eat it too!
4) For serious search capabilities consider specialized engines like solr.
EDIT: 6 April 2013
5) django-ext-hstore gives you access to postgresql hstore type. It's similar
to a python dictionary and you can perform queries on it, with the limitation that
you can't have nested dictionaries as values. Also the value of key can be only
of type string
.
Have fun
Update in response to OP's comment
0) Consider the application 'contains data' and has already been used for a while
I am not sure if you mean that it contains data in a legacy dbms or you are just trying to say that "imagine that the DB is not empty and consider the following points...". In the former case, it seems a migration issue(completely different question), in the latter, well OK.
1) Admin deletes entity "family" and all related data
Why should someone eliminate completely an entity(table)? Either your application has to do with families, houses, etc or it doesn't. Deleting instances(rows) of families is understandable of course.
2) Admin creates entity "house"
Same with #1. If you introduce a brand new entity in your app then most probably it will encapsulate semantics and business logic, for which new code must be written. This happens to all applications as they evolve through time and of course warrants a creation of a new table, or maybe ALTERing an existing one. But this process is not a part of the functionality of your application. i.e. it happens rarely, and is a migration/refactoring issue.
3) Admin adds properties "floors", "age", ..
Why? Don't we know beforehand that a House
has floors? That a User
has a gender?
Adding and removing, dynamically, this type of attributes is not a feature, but a design flaw. It is part of the analysis/design phase to identify your entities and their respective properties.
4) Privileged user adds some houses.
Yes, he is adding an instance(row) to the existing entity(table) House
.
5) User searches for all houses with at least five floors cheaper than 100 $
A perfectly valid query which can be achieved with either SQL or NoSQL solution. In django it would be something along those lines:
House.objects.filter(floors__gte=5, price__lt=100)
provided that House
has the attributes floors
and price
. But if you need to do text-based queries, then neither SQL nor NoSQL will be satisfying enough. Because you don't want to implement faceting or stemming on your own! You will use some of the already discussed solutions(Solr, ElasticSearch, etc).
Some more general notes:
The examples you gave about Houses
, Users
and their properties, do not warrant any dynamic schema. Maybe you simplified your example just to make your point, but you talk about adding/removing Entities(tables)
as if they are rows in a db. Entities are supposed to be a big deal in an application. They define the purpose of your application and its functionality. As such, they can't change every minute.
Also you said:
The idea is that all instances ("rows") of a certain entity ("table") share the same set of properties/attributes ("columns"). However, it will be perfectly valid if certain instances have an empty value for certain properties/attributes.
This seems like a common case where an attribute has null=True
.
And as a final note, I would like to suggest to you to try both approaches(SQL and NoSQL), since it doesn't seem like your career depends on this project. It will be a beneficiary experience, as you will understand first-hand, the cons and pros of each approach. Or even how to "blend" these approaches together.
What you're asking about is a common requirement in many systems -- how to extend a core data model to handle user-defined data. That's a popular requirement for packaged software (where it is typically handled one way) and open-source software (where it is handled another way).
The earlier advice to learn more about RDBMS design generally can't hurt. What I will add to that is, don't fall into the trap of re-implementing a relational database in your own application-specific data model! I have seen this done many times, usually in packaged software. Not wanting to expose the core data model (or permission to alter it) to end users, the developer creates a generic data structure and an app interface that allows the end user to define entities, fields etc. but not using the RDBMS facilities. That's usually a mistake because it's hard to be nearly as thorough or bug-free as what a seasoned RDBMS can just do for you, and it can take a lot of time. It's tempting but IMHO not a good idea.
Assuming the data model changes are global (shared by all users once admin has made them), the way I would approach this problem would be to create an app interface to sit between the admin user and the RDBMS, and apply whatever rules you need to apply to the data model changes, but then pass the final changes to the RDBMS. So for example, you may have rules that say entity names need to follow a certain format, new entities are allowed to have foreign keys to existing tables but must always use the DELETE CASCADE rule, fields can only be of certain data types, all fields must have default values etc. You could have a very simple screen asking the user to provide entity name, field names & defaults etc. and then generate the SQL code (inclusive of all your rules) to make these changes to your database.
Some common rules & how you would address them would be things like:
-- if a field is not null and has a default value, and there are already existing records in the table before that field was added by the admin, update existing records to have the default value while creating the field (multiple steps -- add field allowing null; update all existing records; alter the table to enforce not null w/ default) -- otherwise you wouldn't be able to use a field-level integrity rule)
-- new tables must have a distinct naming pattern so you can continue to distinguish your core data model from the user-extended data model, i.e. core and user-defined have different RDBMS owners (dbo. vs. user.) or prefixes (none for core, __ for user-defined) or somesuch.
-- it is OK to add fields to tables that are in the core data model (as long as they tolerate nulls or have a default), and it is OK for admin to delete fields that admin added to core data model tables, but admin cannot delete fields that were defined as part of the core data model.
In other words -- use the power of the RDBMS to define the tables and manage the data, but in order to ensure whatever conventions or rules you need will always be applied, do this by building an app-to-DB admin function, instead of giving the admin user direct DB access.
If you really wanted to do this via the DB layer only, you could probably achieve the same by creating a bunch of stored procedures and triggers that would implement the same logic (and who knows, maybe you would do that anyway for your app). That's probably more of a question of how comfortable are your admin users working in the DB tier vs. via an intermediary app.
So to answer your questions directly:
(1) Yes, add tables and columns at run time, but think about the rules you will need to have to ensure your app can work even once user-defined data is added, and choose a way to enforce those rules (via app or via DB / stored procs or whatever) when you process the table & field changes.
(2) This issue isn't strongly affected by your choice of SQL vs. NoSQL engine. In every case, you have a core data model and an extended data model. If you can design your app to respond to a dynamic data model (e.g. add new fields to screens when fields are added to a DB table or whatever) then your app will respond nicely to changes in both the core and user-defined data model. That's an interesting challenge but not much affected by choice of DB implementation style.
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With