I have read numerous posts along the lines of multidimensional to single dimension, multidimensional database, and so on, but none of the answers helped. I did find a lot of documentation on Google but that only provided background information and didn't answer the question at hand. I have a lot of strings that are related to one another. They are needed in a PHP script. The structure is hierarchical. Here is an example. <pre class="prettyprint"><code>A: AA: AAA AAC AB AE: AEA AEE: AEEB B: BA: BAA BD: BDC: BDCB BDCE BDD: BDDA BE: BED: BEDA C: CC: CCB: CCBC CCBE CCC: CCCA CCCE CE </code></pre> Each indent supposes a new level in the multidimensional array. The goal is to retrieve an element with PHP by name and all its descendants. If for instance I query for A, I want to receive an array of string containing <code>array('A', 'AA', 'AAA', 'AAC', 'AB', 'AE', 'AEA', 'AEE', 'AEEB')</code>. The 'issue' is that queries can also be made to lower-level elements. If I query AEE, I want to get <code>array('AEE', 'AEEB')</code>. As I understand the concept of relational databases, this means that I cannot use a relational database because there is no common 'key' between elements. The solution that I thought is possible, is assigning PARENT elements to each cell. So, in a table: <pre class="prettyprint"><code>CELL | PARENT A NULL AA A AAA AA AAC AA AB A AE A AEA AE AEE AE AEEB AEE </code></pre> By doing so, I think you should be able to query the given string, and all items that share this parent, and then recursively go down this path until no more items are found. However, this seems rather slow to me because the whole search space would need to be looked through on each level - which is exactly what you don't want in a multidimensional array. So I am a bit at loss. Note that there are actually around 100,000 strings structured in this way, so speed is important. Luckily the database is static and would not change. How can I store such a data structure in a database without having to deal with long loops and search times? <strike>And which kind of database software and data type is best suited for this?</strike> It has come to my attention that PostgreSQL is already present on our servers so I'd rather stick with that. As I said I am new to databases but I am very eager to learn. Therefore, I am looking for an extensive answer that goes into detail and provides advantages and disadvantages of a certain approach. Performance is key. An expected answer would contain the best database type and language for this use case, and also script in that language to build such a structure.

This approach does not depend on the existence of a path or parent column. It is relational not recursive. Since the table is static create a materialized view containing just the leaves to make searching faster: <pre class="prettyprint"><code>create materialized view leave as select cell from ( select cell, lag(cell,1,cell) over (order by cell desc) not like cell || '%' as leave from t ) s where leave; table leave; cell ------ CCCE CCCA CCBE CCBC BEDA BDDA BDCE BDCB BAA AEEB AEA AB AAC AAA </code></pre> A materialized view is computed once at creation not at each query like a plain view. Create an index to speed it up: <pre class="prettyprint"><code>create index cell_index on leave(cell); </code></pre> If eventually the source table is altered just refresh the view: <pre class="prettyprint"><code>refresh materialized view leave; </code></pre> The search function receives text and returns a text array: <pre class="prettyprint"><code>create or replace function get_descendants(c text) returns text[] as $$ select array_agg(distinct l order by l) from ( select left(cell, generate_series(length(c), length(cell))) as l from leave where cell like c || '%' ) s; $$ language sql immutable strict; </code></pre> Pass the desired match to the function: <pre class="prettyprint"><code>select get_descendants('A'); get_descendants ----------------------------------- {A,AA,AAA,AAC,AB,AE,AEA,AEE,AEEB} select get_descendants('AEE'); get_descendants ----------------- {AEE,AEEB} </code></pre> Test data: <pre class="prettyprint"><code>create table t (cell text); insert into t (cell) values ('A'), ('AA'), ('AAA'), ('AAC'), ('AB'), ('AE'), ('AEA'), ('AEE'), ('AEEB'), ('B'), ('BA'), ('BAA'), ('BD'), ('BDC'), ('BDCB'), ('BDCE'), ('BDD'), ('BDDA'), ('BE'), ('BED'), ('BEDA'), ('C'), ('CC'), ('CCB'), ('CCBC'), ('CCBE'), ('CCC'), ('CCCA'), ('CCCE'), ('CE'); </code></pre>

Store multidimensional array in database: relational or multidimensional?

Tags:

performance

sql

database

multidimensional-array

postgresql

I have read numerous posts along the lines of multidimensional to single dimension, multidimensional database, and so on, but none of the answers helped. I did find a lot of documentation on Google but that only provided background information and didn't answer the question at hand.

I have a lot of strings that are related to one another. They are needed in a PHP script. The structure is hierarchical. Here is an example.

A:
  AA:
    AAA
    AAC
  AB
  AE:
    AEA
    AEE:
      AEEB
B:
  BA:
    BAA
  BD:
    BDC:
      BDCB
      BDCE
    BDD:
      BDDA
  BE:
    BED:
      BEDA
C:
  CC:
    CCB:
      CCBC
      CCBE
    CCC:
      CCCA
      CCCE
  CE

Each indent supposes a new level in the multidimensional array.

The goal is to retrieve an element with PHP by name and all its descendants. If for instance I query for A, I want to receive an array of string containing array('A', 'AA', 'AAA', 'AAC', 'AB', 'AE', 'AEA', 'AEE', 'AEEB'). The 'issue' is that queries can also be made to lower-level elements. If I query AEE, I want to get array('AEE', 'AEEB').

As I understand the concept of relational databases, this means that I cannot use a relational database because there is no common 'key' between elements. The solution that I thought is possible, is assigning PARENT elements to each cell. So, in a table:

CELL | PARENT
A      NULL
AA     A
AAA    AA
AAC    AA
AB     A
AE     A
AEA    AE
AEE    AE
AEEB   AEE

By doing so, I think you should be able to query the given string, and all items that share this parent, and then recursively go down this path until no more items are found. However, this seems rather slow to me because the whole search space would need to be looked through on each level - which is exactly what you don't want in a multidimensional array.

So I am a bit at loss. Note that there are actually around 100,000 strings structured in this way, so speed is important. Luckily the database is static and would not change. How can I store such a data structure in a database without having to deal with long loops and search times? ~~And which kind of database software and data type is best suited for this?~~ It has come to my attention that PostgreSQL is already present on our servers so I'd rather stick with that.

As I said I am new to databases but I am very eager to learn. Therefore, I am looking for an extensive answer that goes into detail and provides advantages and disadvantages of a certain approach. Performance is key. An expected answer would contain the best database type and language for this use case, and also script in that language to build such a structure.

597

asked Sep 23 '16 08:09

Bram Vanroy

5 Answers

The goal is to retrieve an element with PHP by name and all its descendants.

If that is all you need, you can use a LIKE search

SELECT *
FROM Table1
WHERE CELL LIKE 'AEE%';

With an index beginning with CELL this is a range check, which is fast.

If your data doesn't look like that, you can create a path column which looks like a directory path and contains all nodes "on the way/path" from root to the element.

| id | CELL | parent_id | path     |
|====|======|===========|==========|
|  1 | A    |      NULL | 1/       |
|  2 | AA   |         1 | 1/2/     |
|  3 | AAA  |         2 | 1/2/3/   |
|  4 | AAC  |         2 | 1/2/4/   |
|  5 | AB   |         1 | 1/5/     |
|  6 | AE   |         1 | 1/6/     | 
|  7 | AEA  |         6 | 1/6/7/   |
|  8 | AEE  |         6 | 1/6/8/   |
|  9 | AEEB |         8 | 1/6/8/9/ |

To retrieve all descendants of 'AE' (including itself) your query would be

SELECT *
FROM tree t
WHERE path LIKE '1/6/%';

or (MySQL specific concatenation)

SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = 'AE'
  AND t.path LIKE CONCAT(r.path, '%');

Result:

| id | CELL | parent_id |     path |
|====|======|===========|==========|
|  6 | AE   |         1 | 1/6/     |
|  7 | AEA  |         6 | 1/6/7/   |
|  8 | AEE  |         6 | 1/6/8/   |
|  9 | AEEB |         8 | 1/6/8/9/ |

Demo

Performance

I have created 100K rows of fake data on MariaDB with the sequence plugin using the following script:

drop table if exists tree;
CREATE TABLE tree (
  `id` int primary key,
  `CELL` varchar(50),
  `parent_id` int,
  `path` varchar(255),
  unique index (`CELL`),
  unique index (`path`)
);

DROP TRIGGER IF EXISTS `tree_after_insert`;
DELIMITER //
CREATE TRIGGER `tree_after_insert` BEFORE INSERT ON `tree` FOR EACH ROW BEGIN
    if new.id = 1 then
        set new.path := '1/';
    else    
        set new.path := concat((
            select path from tree where id = new.parent_id
        ), new.id, '/');
    end if;
END//
DELIMITER ;

insert into tree
    select seq as id
        , conv(seq, 10, 36) as CELL
        , case 
            when seq = 1 then null
            else floor(rand(1) * (seq-1)) + 1 
        end as parent_id
        , null as path
    from seq_1_to_100000
;
DROP TRIGGER IF EXISTS `tree_after_insert`;
-- runtime ~ 4 sec.

Tests

Count all elements under the root:

SELECT count(*)
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = '1'
  AND t.path LIKE CONCAT(r.path, '%');
-- result: 100000
-- runtime: ~ 30 ms

Get subtree elements under a specific node:

SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = '3B0'
  AND t.path LIKE CONCAT(r.path, '%');
-- runtime: ~ 30 ms

Result:

| id    | CELL | parent_id | path                                |
|=======|======|===========|=====================================|
|  4284 | 3B0  |       614 | 1/4/11/14/614/4284/                 |
|  6560 | 528  |      4284 | 1/4/11/14/614/4284/6560/            |
|  8054 | 67Q  |      6560 | 1/4/11/14/614/4284/6560/8054/       |
| 14358 | B2U  |      6560 | 1/4/11/14/614/4284/6560/14358/      |
| 51911 | 141Z |      4284 | 1/4/11/14/614/4284/51911/           |
| 55695 | 16Z3 |      4284 | 1/4/11/14/614/4284/55695/           |
| 80172 | 1PV0 |      8054 | 1/4/11/14/614/4284/6560/8054/80172/ |
| 87101 | 1V7H |     51911 | 1/4/11/14/614/4284/51911/87101/     |

PostgreSQL

This also works for PostgreSQL. Only the string concatenation syntax has to be changed:

SELECT t.*
FROM tree t
CROSS JOIN tree r -- root
WHERE r.CELL = 'AE'
  AND t.path LIKE r.path || '%';

Demo: sqlfiddle - rextester

How does the search work

If you look at the test example, you'll see that all paths in the result begin with '1/4/11/14/614/4284/'. That is the path of the subtree root with CELL='3B0'. If the path column is indexed, the engine will find them all efficiently, because the index is sorted by path. It's like you would want to find all the words that begin with 'pol' in a dictionary with 100K words. You wouldn't need to read the entire dictionary.

143

answered Oct 25 '22 00:10

Paul Spiegel

This approach does not depend on the existence of a path or parent column. It is relational not recursive.

Since the table is static create a materialized view containing just the leaves to make searching faster:

create materialized view leave as
select cell
from (
    select cell,
        lag(cell,1,cell) over (order by cell desc) not like cell || '%' as leave
    from t
) s
where leave;

table leave;
 cell 
------
 CCCE
 CCCA
 CCBE
 CCBC
 BEDA
 BDDA
 BDCE
 BDCB
 BAA
 AEEB
 AEA
 AB
 AAC
 AAA

A materialized view is computed once at creation not at each query like a plain view. Create an index to speed it up:

create index cell_index on leave(cell);

If eventually the source table is altered just refresh the view:

refresh materialized view leave;

The search function receives text and returns a text array:

create or replace function get_descendants(c text)
returns text[] as $$
    select array_agg(distinct l order by l)
    from (
        select left(cell, generate_series(length(c), length(cell))) as l
        from leave
        where cell like c || '%'
    ) s;
$$ language sql immutable strict;

Pass the desired match to the function:

select get_descendants('A');
          get_descendants          
-----------------------------------
 {A,AA,AAA,AAC,AB,AE,AEA,AEE,AEEB}

select get_descendants('AEE');
 get_descendants 
-----------------
 {AEE,AEEB}

Test data:

create table t (cell text);
insert into t (cell) values
('A'),
('AA'),
('AAA'),
('AAC'),
('AB'),
('AE'),
('AEA'),
('AEE'),
('AEEB'),
('B'),
('BA'),
('BAA'),
('BD'),
('BDC'),
('BDCB'),
('BDCE'),
('BDD'),
('BDDA'),
('BE'),
('BED'),
('BEDA'),
('C'),
('CC'),
('CCB'),
('CCBC'),
('CCBE'),
('CCC'),
('CCCA'),
('CCCE'),
('CE');

answered Oct 25 '22 02:10

Clodoaldo Neto

Performance

As others have already mentioned, performance shouldn't be an issue as long as you use a suitable indexed primary key and ensure that relations use foreign keys. In general, an RDBMS is highly optimised to efficiently perform joins on indexed columns and referential integrity can also provide the advantage of preventing orphans. 100,000 may sound a lot of rows but this isn't going to stretch an RDBMS as long as the table structure and queries are well designed.

Choice of RDBMS

One factor in answering this question lies in choosing a database with the ability to perform a recursive query via a Common Table Expression (CTE), which can be very useful to keep the queries compact or essential if there are queries that do not limit the number of descendants being traversed.

Since you've indicated that you are free to choose the RDBMS but it must run under Linux, I'm going to throw PostgreSQL out there as a suggestion since it has this feature and is freely available. (This choice is of course very subjective and there are advantages and disadvantages of each but a few other contenders I'd be tempted to rule out are MySQL since it doesn't currently support CTEs, MariaDB since it doesn't currently support *recursive* CTEs, SQL Server since it doesn't currently support Linux. Other possibilities such as Oracle may be dependent on budget / existing resources.)

SQL

Here's an example of the SQL you'd write to perform your first example of finding all the descendants of 'A':

WITH RECURSIVE rcte AS (
   SELECT id, letters
   FROM cell 
   WHERE letters = 'A'
   UNION ALL
   SELECT c.id, c.letters
   FROM cell c
   INNER JOIN rcte r
   ON c.parent_cell_id = r.id
)
SELECT letters
FROM rcte
ORDER BY letters;

Explanation

The above SQL sets up a "Common Table Expression", i.e. a SELECT to run whenever its alias (in this case rcte) is referenced. The recursion happens because this is referenced within itself. The first part of the UNION picks the cell at the top of the hierarchy. Its descendants are all found by carrying on joining on children in the second part of the UNION until no further records are found.

Demo

The above query can be seen in action on the sample data here: http://rextester.com/HVY63888

answered Oct 25 '22 02:10

Steve Chambers

You absolutely can do that (if I've read your question correctly).

Depending on your RDBMS you might have to choose a different way.

Your basic structure of having a parent is correct.

SQL Server use recursive common table expression (CTE) to anchor the start and work down

https://technet.microsoft.com/en-us/library/ms186243(v=sql.105).aspx

Edit: For Linux use the same in PostgreSQL https://www.postgresql.org/docs/current/static/queries-with.html

Oracle has a different approach, though I think you might be able to use the CTE as well.

https://oracle-base.com/articles/misc/hierarchical-queries

For 100k rows I don't imagine performance will be an issue, though I'd still index PK & FK because that's the right thing to do. If you're really concerned about speed then reading it into memory and building a hash table of linked lists might work.

Pros & cons - it pretty much comes down to readability and suitability for your RDBMS.

It's an already solved problem (again, assuming I've not missed anything) so you'll be fine.

answered Oct 25 '22 02:10

LoztInSpace

I have two words for you... "RANGE KEYS"

You may find this technique to be incredibly powerful and flexible. You'll be able to navigate your hierarchies with ease, and support variable depth aggregation without the need for recursion.

In the demonstration below, we'll build the hierarchy via a recursive CTE. For larger hierarchies 150K+, I'm willing to share a much faster build in needed.

Since your hierarchies are slow moving (like mine), I tend to store them in a normalized structure and rebuild as necessary.

How about some actual code?

Declare @YourTable table (ID varchar(25),Pt varchar(25))
Insert into @YourTable values 
('A'   ,NULL),
('AA'  ,'A'),
('AAA' ,'AA'),
('AAC' ,'AA'),
('AB'  ,'A'),
('AE'  ,'A'),
('AEA' ,'AE'),
('AEE' ,'AE'),
('AEEB','AEE')


Declare @Top  varchar(25) = null     --<<  Sets top of Hier Try 'AEE'
Declare @Nest varchar(25) ='|-----'  --<<  Optional: Added for readability

IF OBJECT_ID('TestHier') IS NOT NULL 
Begin
    Drop Table TestHier
End

;with cteHB as (
      Select Seq  = cast(1000+Row_Number() over (Order by ID) as varchar(500))
            ,ID
            ,Pt
            ,Lvl=1
            ,Title = ID
      From   @YourTable 
      Where  IsNull(@Top,'TOP') = case when @Top is null then isnull(Pt,'TOP') else ID end
      Union  All
      Select cast(concat(cteHB.Seq,'.',1000+Row_Number() over (Order by cteCD.ID)) as varchar(500))
            ,cteCD.ID
            ,cteCD.Pt
            ,cteHB.Lvl+1
            ,cteCD.ID
      From   @YourTable cteCD 
      Join   cteHB on cteCD.Pt = cteHB.ID)
     ,cteR1 as (Select Seq,ID,R1=Row_Number() over (Order By Seq) From cteHB)
     ,cteR2 as (Select A.Seq,A.ID,R2=Max(B.R1) From cteR1 A Join cteR1 B on (B.Seq like A.Seq+'%') Group By A.Seq,A.ID )
Select B.R1  
      ,C.R2
      ,A.ID
      ,A.Pt
      ,A.Lvl
      ,Title = Replicate(@Nest,A.Lvl-1) + A.Title
 Into dbo.TestHier
 From cteHB A
 Join cteR1 B on A.ID=B.ID
 Join cteR2 C on A.ID=C.ID
 Order By B.R1

Show The Entire Hier I added the Title and Nesting for readability

Select * from TestHier Order By R1

enter image description here

Just to state the obvious, the Range Keys are R1 and R2. You may also notice that R1 maintains the presentation sequence. Leaf nodes are where R1=R2 and Parents or rollups define the span of ownership.

To Show All Descendants

Declare @GetChildrenOf varchar(25) = 'AE'
Select A.*
  From TestHier A
  Join TestHier B on B.ID=@GetChildrenOf and A.R1 Between B.R1 and B.R2
  Order By R1

enter image description here

To Show Path

Declare @GetParentsOf varchar(25) = 'AEEB'
Select A.*
  From TestHier A
  Join TestHier B on B.ID=@GetParentsOf and B.R1 Between A.R1 and A.R2
  Order By R1

enter image description here

Clearly these are rather simple illustrations. Over time, I have created a series of helper functions, both Scalar and Table Value Functions. I should also state that you should NEVER hard code range key in your work because they will change.

In Summary

If you have a point (or even a series of points), you'll have its range and therefore you'll immediately know where it resides and what rolls into it.

answered Oct 25 '22 02:10

John Cappelletti

Related questions
                            
                                How do I limit the number of rows returned by this LEFT JOIN to one?
                            
                                Oracle SQL -- Analytic functions OVER a group?
                            
                                file conversion- .mwb to .sql file
                            
                                How to avoid duplicated FETCH in T-SQL when using a cursor?
                            
                                Find records where join doesn't exist
                            
                                Creating View from Another View
                            
                                Insert statement asked me to insert for autoincrement column
                            
                                mysql sort category according to parent id
                            
                                'Script Table as' in SSMS Not Preserving Collation
                            
                                emulating MySQL's substring_index() in PGSQL
                            
                                How to change a dataype CLOB TO VARCHAR2(sql)
                            
                                set decimal places of sum in sql
                            
                                Finding Execution time of query using SQL Developer
                            
                                Postgres constraint for unique datetime range
                            
                                XQuery [nodes()]: Syntax error near '<eof>', expected a step expression
                            
                                sql varchar(max) vs varchar(fix)
                            
                                MySQL Query WHERE through multiple pivot tables
                            
                                Hibernate. PSQLException: bad value for type int : admin
                            
                                How to set to NULL a datetime with 0000-00-00 00:00:00 value?
                            
                                Convert scientific notation back to numbers in SQL Server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Store multidimensional array in database: relational or multidimensional?

Tags:

performance

sql

database

multidimensional-array

postgresql

Bram Vanroy

People also ask

5 Answers

Performance

Tests

PostgreSQL

How does the search work

Paul Spiegel

Clodoaldo Neto

Performance

Choice of RDBMS

SQL

Demo

Steve Chambers

LoztInSpace

John Cappelletti

Recent Activity

Donate For Us