Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should I store multiple, related models when configuration control is required? [closed]

My problem: I plan to store a "project" in a database, where a project is composed of multiple items e.g. documents, and each document has multiple items e.g. paragraphs. Paragraphs may cross-reference to paragraphs in other documents. Many teams exist, and each team may have many projects. Team members edit, update and refine the position of the cross-referenced documents until they are happy, where the document or project is then reviewed.

When a document is issued, following a review, the issued state is retained and any changes occur to a new staging/current state.

When the project is issued, it is reviewed and issued in its entirety. The document set, content and cross-references are then retained in that issue, so that the state is captured across the project. Any following edits are then applied to a new staging/current version, leaving the issued set available for reading as it was at the point of issue.

I have considered using a typical database setup, but am worried that a) there would quickly be a huge growth in the number of rows stored; and b) that finding the "current" set, or the set for any specific project would be too complex/fragile for reliable use.

As a variation, would storing the documents in e.g. JSON in a single document row stop the propagation of fragmented paragraphs, without incurring a large performance hit?

An alternative thought has been to assign a Git repository per-project; but then worry about storage and performance, and having to store "documents" as e.g. JSON documents or similar for the system to work efficiently. You also then end up managing staging areas per-session for each user logged in, which would be slow - right? Except GitHub, GitLab etc allow fast access to repo history...

The database (or equivalent) will be initially accessed via a web application. Eventually, it may be served to local clients over an API, or even allow local clients to work with the git repos, if that's the path that is followed. Ideally, technologies readily accessible to PHP or NodeJS would be used.

Specific question - how would / should multiple related artefacts requiring configuration control be stored and accessed?

like image 938
Kurucu Avatar asked Aug 13 '18 05:08

Kurucu


People also ask

What can be stored under configuration management?

Configuration management can be used to maintain OS configuration files. Example systems include Ansible, Bcfg2, CFEngine, Chef, Nix, Otter, Puppet, Quattor, SaltStack, Terraform, Pulumi and Vagrant. Many of these systems utilize Infrastructure as Code to define and maintain configuration.


1 Answers

TL;DR

The model you expose here is a very clear use case of a relational database. There is no significant performance problem in storing text in a database, and if really you manage to hit database io bottlenecks, server resources are extremely cheap these days, even though any bottlenecks would almost certainly be in your code rather than in database operations.


Outline

I'll first expose thoughts about why I believe event sourcing and nosql is not a good fit, then ramble about a possible way to modelize the domain you speak of, and finish with a few example of how any action could be conducted.

Not Event Sourcing

I first thought about event sourcing, because of its ease in managing the evolution of... events... such as editing a paragraph, reviewing an issue, etc. etc.

While trying to modelize that however, it appears that this is not practical, because to mitigate having to replay the whole event chain every time you consult an issue, you'd have to save a snapshot in a database / cache system. Supporting both the timed snapshots of the issues and event sourcing does not feel like a net win.

Not Nosql Databases

I believe nosql databases are actually a clear loss here. Everything you speak about is totally, absolutely, relational. The structure of the domain does not seem to be ever changing, so you don't gain from a nosql flexibility.

Yes to Relational Databases

Hence, I think that a traditional relational sql database setup should totally be preferred here.

I have first hand experience in handling relatively large amount of texts (very much like documents having multiple arbitrarily long paragraphs) and haven't have any problem with multiplicating rows. They do multiply, especially if you keep track of history, but that's what relational databases are for, handling numerous rows of data.

A Tentative Model of the Database

I think most of the entities mentioned here can be modelized in a straightforward way. By straightforward, I mean having a few standard fields:

- id
- parent_id (when the entity is a child, such as paragraph)
- time_created: timestamp
- time_updated: timestamp 
    some dbms offer automatic updating of timestamp fields on update of the row

Based on that, this could be a paragraph structure. Note that the status is here, it could very well be in the paragraph_revision table, for instance to support reviewing further revisions of a paragraph after it was issued for a first time. Or even possibly both, as we can see a use for that later.

paragraph
- paragraph_id
- document_id
- created_at
- status (to_be_reviewed, in_review, reviewed, rejected, varia)
- order (maybe, if a specific order within a document is desired)

Careful here however, because there is a slightly tricky thing about the situation. Since we want to have a detailed history of all changes, and their authors, the actual content of the paragraph is in an other table, linked to the original (single) paragraph entity. Note that there is only a created_at timestamp value. In this paradigm, paragraph revisions are never updated, new ones are created. A database trigger or other trick can be used to enfore that.

paragraph_revision
- paragraph_id
- created_at
- text
- title (if needed)
- revision_number
    possibly a simple integer based revision number can be used
- author_id

The paragraph references can warrant another many-to-many relationship table. To get all of one paragraph references, you select referenced_paragraph_id from paragraph_reference where paragraph_id = :id or select all paragraph that reference a specific paragraph by switching the columns. Here, you could either reference a general paragraph, or a specific revision of a paragraph, so as not to lose history.

paragraph_reference
- paragraph_id
- referenced_paragraph_id

An issue seems to be a project, one or multiple documents, each composed of one or multiple paragraphs. Basically, an issue table, a issue_document table, linked to a single issue, and a issue_paragraph table, containing specific paragraph_revision ids, linked to these issue_document table. The documents table could be removed, since all paragraph are children of a document, but I prefer being able to select things directly rather then selecting them from their children.

issue
- issue_id
- timestamp

issue_document
- issue_id
- document_id

issue_paragraph
- document_id
- paragraph_content_id

UUIDs Can Be Relevant

This could be a valid case of using uuids for entities instead of numerical ids, especially if growth makes it necessary to duplicate databases or be able to create a valid entity before sending it to the database.

Foreign Keys Are Not Optional

While not overly complex, this is a somewhat fancy schema. Foreign keys are not an option. Every parent id must have a foreign key, the database integrity must be enforced by the database engine, or this will become fragile. With foreign keys, a somewhat complex system such as this can and will stay coherent over time.

Managing Domain Actions

Hopefully the few examples I gave of how to create tables can give an idea of the general structure, now to some concrete example of how it can be practical for your different use cases.

Creating a Project

Fairly straightforward: add a row in the project table, no need for any other action just yet.

Creating a Paragraph

Two rows need to be added: one that will uniquely identify the paragraph in paragraph, one containing its content in paragraph_revision.

Updating a Paragraph

A single row is added in the paragraph_revision table. If people are saving each 5 seconds, possibly a single entry can be used and updated until user makes a "I'm ok with this revision" action. (making this the only exception to the "don't update revisions" rule. Other tricks could be used, such as a temp_paragraph_revision table.)

Reviewing a Paragraph

Selecting the highest numbered revision of a paragraph_revision would let you see the current version to be reviewed. A list of all revisions can be made by selecting all contents for a specific paragraph_id. Upon setting the review, the status is updated in the paragraph and paragraph_revision table, possibly preventing any revision to be made, or resetting the review status if a revision is added.

Issuing a Paragraph or Project

Create a row in the issue table, link documents to that issue in the issue_document table, link paragraph_revisions to these documents in the issue_paragraph table. There is a choice that can be made here, that is to include the actual content of the revision in this issue_paragraph table, to be extra certain that a paragraph content can't be changed after an issue, but if the rule of never update a revision is used, that shouldn't be necessary.

Conclusion

While this may seem like a lot of tables to create beforehand, most of them are rather small, and a good UML diagram would clear most hesitation on those. Joins are used in this, make sure to keep indexes on the columns that are used in these joins. All of this can be implement in most sql distributions, MariaDB if you're familiar with it and MySQL, or postgres if you know it.

like image 56
Félix Adriyel Gagnon-Grenier Avatar answered Oct 09 '22 20:10

Félix Adriyel Gagnon-Grenier