Sparse data: efficient storage and retrieval in an RDBMS

Tags:

I have a table representing values of source file metrics across project revisions, like the following:

Revision FileA FileB FileC FileD FileE ...
1           45     3    12   123   124
2           45     3    12   123   124
3           45     3    12   123   124
4           48     3    12   123   124
5           48     3    12   123   124
6           48     3    12   123   124
7           48    15    12   123   124

(The relational view of the above data is different. Each row contains the following columns: Revision, FileId, Value. The files and their revisions from which the data is calculated are stored in Subversion repositories, so we're trying to represent the repository's structure in a relational schema.)

There can be up to 23750 files in 10000 revisions (this is the case for the ImageMagick drawing program). As you can see, most values are the same between successive revisions, so the table's useful data is quite sparse. I am looking for a way to store the data that

avoids replication and uses space efficiently (currently the non-sparse representation requires 260 GB (data+index) for less than 10% of the data I want to store)
allows me to retrieve efficiently the values for a specific revision using an SQL query (without explicitly looping through revisions or files)
allows me to retrieve efficiently the revision for a specific metric value.

Ideally, the solution should not depend on a particular RDBMS and should be compatible with Hibernate. If this is not possible, I can live with using Hibernate, MySQL or PostgreSQL-specific features.

272

asked Jan 05 '09 15:01

Diomidis Spinellis

1 Answers

This is how I might model it. I've left out the Revisions table and Files table as those should be pretty self-explanatory.

CREATE TABLE Revision_Files
(
    start_revision_number   INT NOT NULL,
    end_revision_number     INT NOT NULL,
    file_number             INT NOT NULL,
    value                   INT NOT NULL,
    CONSTRAINT PK_Revision_Files PRIMARY KEY CLUSTERED (start_revision_number, file_number),
    CONSTRAINT CHK_Revision_Files_start_before_end CHECK (start_revision_number <= end_revision_number)
)
GO

To get all of the values for files of a particular revision you could use the following query. Joining to the files table with an outer join would let you get those that have no defined value for that revision.

SELECT
    REV.revision_number,
    RF.file_number,
    RF.value
FROM
    Revisions REV
INNER JOIN Revision_Files RF ON
    RF.start_revision_number <= REV.revision_number AND
    RF.end_revision_number >= REV.revision_number
GO

Assuming that I understand correctly what you want in your third point, this will let you get all of the revisions for which a particular file has a certain value:

SELECT
    REV.revision_number
FROM
    Revision_Files RF
INNER JOIN Revisions REV ON
    REV.revision_number BETWEEN RF.start_revision_number AND RF.end_revision_number
WHERE
    RF.file_number = @file_number AND
    RF.value = @value
GO

111

answered Sep 30 '22 18:09

Tom H

Related questions
                            
                                PrintWindow WPF / DirectX
                            
                                Using \verbatim as part of an argument for a macro?
                            
                                Embedded HTML control for Blackberry?
                            
                                How do you block selected applications from accessing the internet (C++, Win32)
                            
                                Data structure or algorithm for second degree lookups in sub-linear time?
                            
                                Best practicies for JUnit and RMI Apps, RMI Registry
                            
                                The "Waiting lists problem"
                            
                                How does Microsoft's Entity Framework inhibit test driven development? [closed]
                            
                                Delphi Components for Face Identification and Tagging
                            
                                TextBox Undo/Redo commands
                            
                                Why is Visual Studios Installer not picking up dependencies
                            
                                What should "value_from_datadict" method of a custom form widget return?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With