Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mysql - flexible, excel-like structure

Tags:

sql

mysql

excel

I have recently inherited a already started project, and I have one challenge right now. One of the requirements is to allow a user to create a "database" inside the application, that can have a variable number of user-defined columns (it's an excel-like structure).

Here's the sqlfiddle for my current structure.

Here's a query I am using to fetch rows:

select      `row`, 
            group_concat(dd.value order by field(`col`, 1, 2, 3) asc) as `values`
from        db_record dr,
            db_dictionary dd
where       dr.database_id in (1, 2, 3)
and         dr.database_dictionary_id = dd.id
group by    `row`
order by    group_concat(dd.value order by field(`col`, 1, 2, 3) asc);

Ability to sort by any column is achieved by using group_concat().

I am thinking about that design, because I have some doubts regarding performance and meeting requirements:

  • It has to be sortable (by any column), meaning that user sorts asc by column 2, and rows are ordered properly.
  • It has to be searchable/filterable. User can filter by values in any column, and only rows containing search phrase should be returned.

First requirement I think is handled by the query I pasted above. Second one - I also tried adding HAVING clause to the query with LIKE, but it compared the whole GROUP_CONCAT() result.

Can someone advise, whether the current DB structure is ok for the purpose and help me with the latter requirement? Or maybe there's a better approach to the problem?

Last question, is it possible to return values for each column in one query? In DB, records look like this:

-------------------------------------------
| database_id | dictionary_id | row | col |
-------------------------------------------
| 1           | 1             | 1   | 1   |
-------------------------------------------
| 2           | 2             | 1   | 2   |
-------------------------------------------
| 3           | 3             | 1   | 3   |
-------------------------------------------

And I would like to get a query result groupped by row, similar to that: (column 1 .. 3 values are dictionary_id values)

----------------------------------------
| row | column 1 | column 2 | column 3 | 
----------------------------------------
| 1   | 1        | 2        | 3        |
----------------------------------------

Is that achievable in mysql? Or the only solution is to use GROUP_CONCAT() and then I can use php to split into columns?

I need a flexlible and efficient structure, and I hope someone can advise me on that, I would really appreciate any help or suggestions.

like image 315
mkrowiarz Avatar asked Mar 02 '14 09:03

mkrowiarz


3 Answers

Excel-2-MySQL

A Flexible, Dynamic Adaption of Excel Format to a MySQL Relational Schema

The approach of this solution may work for other relational database systems as it does not rely on any specific features of MySQL except for SQL compliant DDL and DML commands. The maintenance of this database can be handled through a combination of internal db constraints and stored procedure apis, or externally by an alternate scripting language and user interface. The focus of this walk through is the purpose of the schema design, organization of the data and supporting values as well as potential points of expansion for additional enhancements.

Schema Overview and Design Concepts to Adapt a Spreadsheet

The schema leverages an assumption that each data point on the spreadsheet grid can be represented by a unique combination of keys. The simplest combination would be a row-column coordinate pair, such as "A1" (Column A, Row Number 1) or "G72" (Column G, Row Number 72)

This walk-through demonstration will show how to adapt the following data sample in spreadsheet form into a reusable, multi-user relational database format.

Sample Spreadsheet Data for Excel Simulation

A pair of coordinates can also include a uniquely assigned spreadsheet/mini-db ID value. For a multi-user environment, the same schema can still be used by adding a supporting user ID value to associate with each spreadsheet ID.

Defining the Smallest Schema Unit: The Vector

After bundling together all the identifying meta info about each data point, the collection is now tagged with a single, globally unique ID, which to some may now appear like a catalog of "vectors".

A VECTOR by mathematical definition is a collection of multiple components and their values used to simplify solutions for problems which exist in spaces that are described through multiple (n) dimensions.

The solution is scalable: mini-databases can be as small as 2 rows x 2 columns or hundreds to thousands of rows and columns wide.

Search, Sort and Pivot Easily

To build search queries from the data values of vectors that have common attributes such as:

  1. Database/Spreadsheet ID and Owner (Example, 10045, Owner = 'HELEN')
  2. Same Column: (Example, Column "A")

Your data set would be all vector id's and their associated data values which have these common values. Pivot outputs could be accomplished generically with probably some simple matrix algebra transformations... a spreadsheet grid is only two dimensions, so it can't be that hard!

Handling Different Data Types: Some Design Considerations

The simple approach: Store all the data as VARCHAR types but keep track of the original data type so that when you query the vector's data value you can apply the right conversion function. Just be consistent and use your API or input process to vigilantly police the population of your data in the data store... the last thing you'll want to end up debugging is a Numeric conversion function that has encountered a STRING typed character.

The next section contains the DDL code to set up a one-table solution which uses multiple columns to manage the different possible data types that may be hosted within a given spreadsheet grid.

A Single Table Solution for Serving a Spreadsheet Grid Through MySQL

Below is the DDL worked out on MySQL 5.5.32.

-- First Design Idea... Using a Single Table Solution.

CREATE TABLE DB_VECTOR 
    (
      vid int auto_increment primary key,
      user_id varchar(40),
      row_id int,
      col_id int,
      data_type varchar(10), 
      string_data varchar(500),
      numeric_data int,
      date_data datetime
    );

-- Populate Column A with CITY values

INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
  string_data, numeric_data, date_data)
VALUES ('RICHARD', 2, 1, 'STRING', 'ATLANTA', NULL, NULL);

INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
  string_data, numeric_data, date_data)
VALUES ('RICHARD', 3, 1, 'STRING', 'MACON', NULL, NULL);

INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
  string_data, numeric_data, date_data)
VALUES ('RICHARD', 4, 1, 'STRING', 'SAVANNAH', NULL, NULL);

INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
  string_data, numeric_data, date_data)
VALUES ('RICHARD', 5, 1, 'STRING', 'FORT BENNING', NULL, NULL);

INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
  string_data, numeric_data, date_data)
VALUES ('RICHARD', 6, 1, 'STRING', 'ATHENS', NULL, NULL);

-- Populate Column B with POPULATION values

INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
  string_data, numeric_data, date_data)
VALUES ('RICHARD', 2, 2, 'NUMERIC', NULL, 1500000, NULL);

INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
  string_data, numeric_data, date_data)
VALUES ('RICHARD', 3, 2, 'NUMERIC', NULL, 522000, NULL);

INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
  string_data, numeric_data, date_data)
VALUES ('RICHARD', 4, 2, 'NUMERIC', NULL, 275200, NULL);

INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
  string_data, numeric_data, date_data)
VALUES ('RICHARD', 5, 2, 'NUMERIC', NULL, 45000, NULL);

INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
  string_data, numeric_data, date_data)
VALUES ('RICHARD', 6, 2, 'NUMERIC', NULL, 1325700, NULL);

There is a temptation to run off and start over-normalizing this table, but redundancy may not be that bad. Separate off information that is related to the spreadsheets (Such as OWNER/USER Name and other demographic info) but otherwise keep things together until you understand the purpose of the vector-based design and some of the performance trade-offs.

One such tradeoff with a over-normalized schema is that now the required data values are scattered across multiple tables. Filter criteria may now have to apply on different tables involved in these joins. Ironic as it may seem, I have observed that flattened, singular table structures fare well when it comes to querying and reporting despite some apparent redundancy.

An Additional Note: Creating tables for supporting data linked to the main data source via Foreign Key relations are a different story... an implied relation exists between tables, but many RDBMS systems actually self-optimize based on Foreign Key connections.

For Example: Searching the USER_OWNER column with several million records benefits from a potential boost if it is linked by a FK to a supporting table which identifies a finite user list of 20 people... This is also known as an issue of CARDINALITY, which helps the database build execution plans that can take short-cuts through an otherwise unknown data set.

Getting Your Data Back Out: Some Sample Queries

The first is a base query to pull the data back out in a organized, grid-like format... just like the original Excel page.

    SELECT base_query.CITY, base_query.POPULATION
    FROM (

    SELECT CASE WHEN col_a.data_type = 'STRING'
                THEN col_a.string_data
                WHEN col_a.data_type = 'NUMERIC'
                THEN col_a.numeric_data
                WHEN col_a.data_type = 'DATETIME'
                THEN col_a.date_data ELSE NULL END as CITY,
           CASE WHEN col_b.data_type = 'STRING'
                THEN col_b.string_data
                WHEN col_b.data_type = 'NUMERIC'
                THEN col_b.numeric_data
                WHEN col_b.data_type = 'DATETIME'
                THEN col_b.date_data ELSE NULL END as POPULATION
     FROM db_vector col_a, db_vector col_b
     WHERE ( col_a.col_id = 1 AND col_b.col_id = 2 )
       AND ( col_a.row_id = col_b.row_id)

    ) base_query WHERE base_query.POPULATION >= 500000

    ORDER BY base_query.POPULATION DESC

Even the base query here is still a little specific to manage a scalable, generic solution for a spreadsheet of one or many values in width or length. But you can see how the internal query in this example remains untouched and a complete data set can quickly be filtered or sorted in different ways.

Some Parting Thoughts: (a.k.a. Some Optional Homework)

  1. It is possible to solve this with an flexible, multi-table solution. I was able to accomplish this in THREE.

    DB_VECTOR (as you have already seen) underwent some modifications: data values were moved out and strictly positional information (row and column id's) plus a globally unique spreadsheet id was left behind.

    DB_DATA was used as the final home for the raw data fields: STRING_DATA, NUMERIC_DATA, and DATE_DATA... each record uniquely identified by a VID (vector id).

In the multi-table solution, I used the unique VID instead as a pointer with multiple associated dimensions (owner, sheet id, row, column, etc.) to point to its corresponding data value.

An example of the utility of this design: the possibility of a "look-up" function or query that identifies a collection of vector ids and the data they point to based on the properties of the data itself, or the vector components (row, column, sheet id, etc.)... or a combination.

The possibility is that instead of circulating a whole lot of data (the spreadsheet itself) between different parts of the code handling this schema, queries deal with only the specific properties and just push around lists (arrays?) or sets of universally unique ids which point to the data as it is needed.

  1. Initializing New Spreadsheets: If you pursue the multi-table design, your DB_VECTOR table becomes a hollow collection of bins with pointers to the actual data. Before you populate the raw data values, the VECTOR_ID (vid) will need to exist first so you can link the two values.

  2. Which Way is UP???: Using numeric values for row and column id's seemed like the easy way first, but I noticed that: (a) I was easily mixing up columns and rows... and worse, not noticing it until it was too late; (b) Excel actually has a convention: Rows (numeric), Columns (Alphabetic: A through ZZ+?) Will users miss the convention or get lost when using our schema? Are there any problems with adopting a non-numeric identification scheme for our data vectors?

  3. Yet Another Dimension: Excel Spreadsheets have MULTIPLE sheets. How would support for this convention change the design of your VECTORS? Engineers and scientists even push this limit to more than the three dimensions humans can see. How would that change things? If you tried it, did you find out if it imposed a limitation, or did it matter at all?

  4. Stumbled Into This One...: My current DB_VECTOR table contains an extra VARCHAR value called "DETAILS". I found it a useful catch-bin for a miscellaneous, custom attribute that can be unique all the way down to the lowest (VECTOR ID/POINTER) level... or you can use it to create a custom label for an unusual collection of vectors that may not have an easily definable relation (like Excel's "Range Name" property)... What would you use it for?

If you're still with me... thanks. This one was a challenging thought exercise in database design. I have purposely left out fully expanded discussions on optimization and performance considerations for the sake of clarity... perhaps something to consider at a later time.

Best Wishes on Your Project.

like image 178
Richard Pascual Avatar answered Sep 28 '22 12:09

Richard Pascual


Why not model tabular storage as a table? Just build the ALTER|CREATE|DROP TABLE statements ad hoc, and you can reap all the benefits of actually having a database server. Indexes and SQL come to mind.

Example schema:

CREATE TABLE Worksheets
(
    WorksheetID int auto_increment primary key,
    WorkbookID int not null,
    Name varchar(256) not null,
    TableName nvarchar(256) not null
);

CREATE TABLE Columns
(
    ColumnID int auto_increment primary key,
    WorksheetID int not null,
    ColumnSequenceNo int not null,
    Name varchar(256) not null,
    PerceivedDatatype enum ('string', 'number') not null
)

-- Example of a dynamically generated data table:
-- Note: The number in the column name would correspond to 
-- ColumnSequenceNo in the Columns table
CREATE TABLE data_e293c71b-b894-4652-a833-ba817339809e
(
    RowID int auto_increment primary key,
    RowSequenceNo int not null,
    Column1String varchar(256) null,
    Column1Numeric double null,
    Column2String varchar(256) null,
    Column2Numeric double null,
    Column3String varchar(256) null,
    Column3Numeric double null,
    -- ...
    ColumnNString varchar(256) null,
    ColumnNNumeric double null
);

INSERT INTO Worksheets (WorkbookID, Name, TableName)
VALUES (1, `Countries`, `data_e293c71b-b894-4652-a833-ba817339809e`);

SET @worksheetID = LAST_INSERT_ID();

INSERT INTO Columns (WorksheetID, ColumnSequenceNo, Name, PerceivedDatatype)
VALUES (@worksheetID, 1, `Country Name`, `string`),
       (@worksheetID, 2, `Population`, `numeric`),
       (@worksheetID, 3, `GDP/person`, `numeric`);

-- example of an insert for a new row:
-- if the new data violates any perceived types, update them first
INSERT INTO data_e293c71b-b894-4652-a833-ba817339809e (
    RowSequenceNo,
    Column1String,
    Column2String, Column2Numeric,
    Column3String, Column3Numeric)
VALUES (
    1,
    `United States of America`,
    `3000000`, 3000000,
    `34500`, 34500);

-- example of a query on the first column:
select * 
from data_e293c71b-b894-4652-a833-ba817339809e 
where Column1String like `United%`;

-- example of a query on a column with a numeric perceived datatype:
select * 
from data_e293c71b-b894-4652-a833-ba817339809e 
where Column3Numeric between 4000 and 40000;

Moral of the story is that you shouldn't fight the database server — use it to your advantage.

like image 37
Mitch Avatar answered Sep 28 '22 11:09

Mitch


select      `row`,
            group_concat(if(field(`row`, 1), dd.value, null)) as row1,
            group_concat(if(field(`row`, 2), dd.value, null)) as row2,
            group_concat(if(field(`row`, 3), dd.value, null)) as row3
from        db_record dr
left join   db_dictionary dd on (dr.dictionary_id = dd.id)
where       dr.database_id = 0
group by    `column`
having      row1 like '%biu%'
order by    `row` uni;
like image 43
user8795462 Avatar answered Sep 28 '22 10:09

user8795462