My problem: I plan to store a "project" in a database, where a project is composed of multiple items e.g. documents, and each document has multiple items e.g. paragraphs. Paragraphs may cross-reference to paragraphs in other documents. Many teams exist, and each team may have many projects. Team members edit, update and refine the position of the cross-referenced documents until they are happy, where the document or project is then reviewed.
When a document is issued, following a review, the issued state is retained and any changes occur to a new staging/current state.
When the project is issued, it is reviewed and issued in its entirety. The document set, content and cross-references are then retained in that issue, so that the state is captured across the project. Any following edits are then applied to a new staging/current version, leaving the issued set available for reading as it was at the point of issue.
I have considered using a typical database setup, but am worried that a) there would quickly be a huge growth in the number of rows stored; and b) that finding the "current" set, or the set for any specific project would be too complex/fragile for reliable use.
As a variation, would storing the documents in e.g. JSON in a single document row stop the propagation of fragmented paragraphs, without incurring a large performance hit?
An alternative thought has been to assign a Git repository per-project; but then worry about storage and performance, and having to store "documents" as e.g. JSON documents or similar for the system to work efficiently. You also then end up managing staging areas per-session for each user logged in, which would be slow - right? Except GitHub, GitLab etc allow fast access to repo history...
The database (or equivalent) will be initially accessed via a web application. Eventually, it may be served to local clients over an API, or even allow local clients to work with the git repos, if that's the path that is followed. Ideally, technologies readily accessible to PHP or NodeJS would be used.
Specific question - how would / should multiple related artefacts requiring configuration control be stored and accessed?
Configuration management can be used to maintain OS configuration files. Example systems include Ansible, Bcfg2, CFEngine, Chef, Nix, Otter, Puppet, Quattor, SaltStack, Terraform, Pulumi and Vagrant. Many of these systems utilize Infrastructure as Code to define and maintain configuration.
The model you expose here is a very clear use case of a relational database. There is no significant performance problem in storing text in a database, and if really you manage to hit database io bottlenecks, server resources are extremely cheap these days, even though any bottlenecks would almost certainly be in your code rather than in database operations.
I'll first expose thoughts about why I believe event sourcing and nosql is not a good fit, then ramble about a possible way to modelize the domain you speak of, and finish with a few example of how any action could be conducted.
I first thought about event sourcing, because of its ease in managing the evolution of... events... such as editing a paragraph, reviewing an issue, etc. etc.
While trying to modelize that however, it appears that this is not practical, because to mitigate having to replay the whole event chain every time you consult an issue, you'd have to save a snapshot in a database / cache system. Supporting both the timed snapshots of the issues and event sourcing does not feel like a net win.
I believe nosql databases are actually a clear loss here. Everything you speak about is totally, absolutely, relational. The structure of the domain does not seem to be ever changing, so you don't gain from a nosql flexibility.
Hence, I think that a traditional relational sql database setup should totally be preferred here.
I have first hand experience in handling relatively large amount of texts (very much like documents having multiple arbitrarily long paragraphs) and haven't have any problem with multiplicating rows. They do multiply, especially if you keep track of history, but that's what relational databases are for, handling numerous rows of data.
I think most of the entities mentioned here can be modelized in a straightforward way. By straightforward, I mean having a few standard fields:
- id - parent_id (when the entity is a child, such as paragraph) - time_created: timestamp - time_updated: timestamp some dbms offer automatic updating of timestamp fields on update of the row
Based on that, this could be a paragraph structure. Note that the status is here, it could very well be in the paragraph_revision
table, for instance to support reviewing further revisions of a paragraph after it was issued for a first time. Or even possibly both, as we can see a use for that later.
paragraph - paragraph_id - document_id - created_at - status (to_be_reviewed, in_review, reviewed, rejected, varia) - order (maybe, if a specific order within a document is desired)
Careful here however, because there is a slightly tricky thing about the situation. Since we want to have a detailed history of all changes, and their authors, the actual content of the paragraph is in an other table, linked to the original (single) paragraph entity. Note that there is only a created_at
timestamp value. In this paradigm, paragraph revisions are never updated, new ones are created. A database trigger or other trick can be used to enfore that.
paragraph_revision - paragraph_id - created_at - text - title (if needed) - revision_number possibly a simple integer based revision number can be used - author_id
The paragraph references can warrant another many-to-many relationship table. To get all of one paragraph references, you select referenced_paragraph_id from paragraph_reference where paragraph_id = :id
or select all paragraph that reference a specific paragraph by switching the columns. Here, you could either reference a general paragraph, or a specific revision of a paragraph, so as not to lose history.
paragraph_reference - paragraph_id - referenced_paragraph_id
An issue seems to be a project, one or multiple documents, each composed of one or multiple paragraphs. Basically, an issue
table, a issue_document
table, linked to a single issue, and a issue_paragraph
table, containing specific paragraph_revision
ids, linked to these issue_document
table. The documents table could be removed, since all paragraph are children of a document, but I prefer being able to select things directly rather then selecting them from their children.
issue - issue_id - timestamp issue_document - issue_id - document_id issue_paragraph - document_id - paragraph_content_id
This could be a valid case of using uuids for entities instead of numerical ids, especially if growth makes it necessary to duplicate databases or be able to create a valid entity before sending it to the database.
While not overly complex, this is a somewhat fancy schema. Foreign keys are not an option. Every parent id must have a foreign key, the database integrity must be enforced by the database engine, or this will become fragile. With foreign keys, a somewhat complex system such as this can and will stay coherent over time.
Hopefully the few examples I gave of how to create tables can give an idea of the general structure, now to some concrete example of how it can be practical for your different use cases.
Fairly straightforward: add a row in the project
table, no need for any other action just yet.
Two rows need to be added: one that will uniquely identify the paragraph in paragraph
, one containing its content in paragraph_revision
.
A single row is added in the paragraph_revision
table. If people are saving each 5 seconds, possibly a single entry can be used and updated until user makes a "I'm ok with this revision" action. (making this the only exception to the "don't update revisions" rule. Other tricks could be used, such as a temp_paragraph_revision
table.)
Selecting the highest numbered revision of a paragraph_revision
would let you see the current version to be reviewed. A list of all revisions can be made by selecting all contents for a specific paragraph_id
. Upon setting the review, the status is updated in the paragraph
and paragraph_revision
table, possibly preventing any revision to be made, or resetting the review status if a revision is added.
Create a row in the issue table, link documents to that issue in the issue_document table, link paragraph_revision
s to these documents in the issue_paragraph
table. There is a choice that can be made here, that is to include the actual content of the revision in this issue_paragraph
table, to be extra certain that a paragraph content can't be changed after an issue, but if the rule of never update a revision is used, that shouldn't be necessary.
While this may seem like a lot of tables to create beforehand, most of them are rather small, and a good UML diagram would clear most hesitation on those. Joins are used in this, make sure to keep indexes on the columns that are used in these joins. All of this can be implement in most sql distributions, MariaDB if you're familiar with it and MySQL, or postgres if you know it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With