I have a table representing values of source file metrics across project revisions, like the following:
Revision FileA FileB FileC FileD FileE ...
1 45 3 12 123 124
2 45 3 12 123 124
3 45 3 12 123 124
4 48 3 12 123 124
5 48 3 12 123 124
6 48 3 12 123 124
7 48 15 12 123 124
(The relational view of the above data is different. Each row contains the following columns: Revision, FileId, Value. The files and their revisions from which the data is calculated are stored in Subversion repositories, so we're trying to represent the repository's structure in a relational schema.)
There can be up to 23750 files in 10000 revisions (this is the case for the ImageMagick drawing program). As you can see, most values are the same between successive revisions, so the table's useful data is quite sparse. I am looking for a way to store the data that
Ideally, the solution should not depend on a particular RDBMS and should be compatible with Hibernate. If this is not possible, I can live with using Hibernate, MySQL or PostgreSQL-specific features.
"Sparse" data, in which relations have many attributes that are null for most tuples, presents a challenge for an RDBMS. The traditional RDBMS approach to physical storage assumes that rows of tables have few null values, and that most tables will have tens of columns, and never more than a few hundred columns.
Sparse data is data that has mostly unused elements (elements that don't carry any information ). Sparse Data: is a data set where most of the item values are zero. Dense Array: is the opposite of a sparse array: most of the values are not zero.
Sparse columns are ordinary columns that have an optimized storage for null values. Sparse columns reduce the space requirements for null values at the cost of more overhead to retrieve non-NULL values. Consider using sparse columns when the space saved is at least 20 percent to 40 percent.
This is how I might model it. I've left out the Revisions table and Files table as those should be pretty self-explanatory.
CREATE TABLE Revision_Files
(
start_revision_number INT NOT NULL,
end_revision_number INT NOT NULL,
file_number INT NOT NULL,
value INT NOT NULL,
CONSTRAINT PK_Revision_Files PRIMARY KEY CLUSTERED (start_revision_number, file_number),
CONSTRAINT CHK_Revision_Files_start_before_end CHECK (start_revision_number <= end_revision_number)
)
GO
To get all of the values for files of a particular revision you could use the following query. Joining to the files table with an outer join would let you get those that have no defined value for that revision.
SELECT
REV.revision_number,
RF.file_number,
RF.value
FROM
Revisions REV
INNER JOIN Revision_Files RF ON
RF.start_revision_number <= REV.revision_number AND
RF.end_revision_number >= REV.revision_number
GO
Assuming that I understand correctly what you want in your third point, this will let you get all of the revisions for which a particular file has a certain value:
SELECT
REV.revision_number
FROM
Revision_Files RF
INNER JOIN Revisions REV ON
REV.revision_number BETWEEN RF.start_revision_number AND RF.end_revision_number
WHERE
RF.file_number = @file_number AND
RF.value = @value
GO
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With