Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Build tree-like hash (YAML) from a complex database extraction (SQL)?

Tags:

hash

perl

Introduction

Considering the tables given at the end of this question, I would like an algorithm or a simple solution that returns a nested tree from a YAML description. Using yaml format is an optional need. In fact, the output I need is an array of ordered hashes that may or may not contain nested ordered hashes or arrays of ordered hashes.

In short, I am talking about a tree-like structure.

For a better understanding of my question I will treat a simple example that covers all my needs. Actually this example is the one I am using to implement this algorithm.

I decided to ask this question in parallel with my own investigations as my knowledge in Perl is limited. I don't want to dig into the wrong tunnel and that's why I am asking for help.

I am currently focussing on the DBI module. I tried to look at other modules such as DBIx::Tree::NestedSet, but I don't think it is what I need.

So, let's get down to the details of my example.

Example

The inital idea is to write a perl program that takes a yaml description and outputs the extracted data.

This input description follows simple rules:

  • query is what data we are looking for. It can contains the following keys
    • sql is the SQL query
    • hide hides columns from the final output. This field is used when a column is required only in a subquery but not wanted in the end.
    • subquery is a nested query executed for each row of the parent query
    • bind to bind columns values to the sql query
    • hash tells the program to group the results not in an array of hashes but an hash of hashes. Actually this could be directly given to DBI::selectall_hashref. If this field is omitted the output is listed as an array of ordered hashes.
    • key is the name of the key listed at the same level of the parent's result. We will see later that a key name can mask a result column.
    • list tells the program to list the result into an array. Notice that only one column can be displayed i.e. array: name displays the list of names
  • connect is the DBI connection string
  • format is the output format. It can be either XML, YAML or JSON. I primarly focus on the YAML format as it can be easily translated. When omitted, the default ouput is YAML.
  • indent how many spaces is one identation. The tabs or tab value is also supported.

In addition, we know that in Perl hashes are not ordered. Here, the output keys order is important and should appear as they appear in the sql query.

From this I simpy use the YAML module :(

In summary, in the end we will just execute this command:

$ cat desc.yml | ./fetch > data.yml

The desc.yml description is given below:

---
connect: "dbi:SQLite:dbname=einstein-puzzle.sqlite"
ident: 4
query:
   - sql: SELECT * from people
     hide:
        - pet_id
        - house_id
        - id
     subquery:
        - key: brevage
          bind: id
          sql: |
                SELECT name, calories, potassium FROM drink
                LEFT JOIN people_has_drink ON drink.id = people_has_drink.id_drink
                WHERE people_has_drink.id_people = 1       
          hash:
             - name
        - key: house
          sql: SELECT color as paint, size, id from house WHERE id = ?
          hide: id
          bind: paint
          subquery:
             - key: color
               sql: SELECT name, ral, hex from color WHERE short LIKE ?
               bind: color
        - key: pet
          sql: SELECT name from pet WHERE id = ?
          bind: pet_id
          list: name

Expected Output

From the description above, the output data would be this:

---
- nationality: Norvegian
  smoke: Dunhill
  brevage:
      orange juice:
          calories: 45
          potassium: 200 mg
      water:
          calories: 0
          potassium: 3 mg
  house:
      color:
          name: Zinc yellow
          ral:  RAL 1018
          hex:  #F8F32B
      paint: yellow
      size: small
  pet:
      - cats
- nationality: Brit
  smoke: Pall Mall
  brevage:
      milk:
          calories: 42
          potassium: 150 mg
  house:
      color:
          name: Vermilion
          ral:  RAL 2002
          hex:  #CB2821
      paint: red
      size: big
  pet:
      - birds
      - phasmatodea

Where I am

I still did not fully implemented the nested queries. My current sate is given here:

#!/usr/bin/env perl
use 5.010;
use strict;
use warnings;
use DBI;
use YAML;
use Data::Dumper;
use Tie::IxHash;

# Read configuration and databse connection
my %yml = %{ Load(do { local $/; <DATA>}) };
my $dbh = DBI->connect($yml{connect});

# Fill the bind values of the first query with command-line information
my %bind;
for(@ARGV) {
    next unless /--(\w+)=(.*)/;
    $bind{$1} = $2;
}

my $q0 = $yml{query}[0];
if ($q0->{bind} and keys %bind > 0) {
    $q0->{bind_values} = arrayref($q0->{bind});
    $q0->{bind_values}[$_] = $bind{$q0->{bind}[$_]} foreach (0 .. @{$q0->{bind}} - 1);
}

# Fetch all data from the database recursively
my $out = fetch($q0);

sub fetch {
    # As long we have a query, one processes it
    my $query = shift; 
    return undef unless $query;

    $query->{bind_values} = [] unless ref $query->{bind_values} eq 'ARRAY';
    # Execute SQL query
    my $sth = $dbh->prepare($query->{sql});
    $sth->execute(@{$query->{bind_values}});
    my @columns = @{$sth->{NAME}}; 

    # Fetch all the current level's data and preserve columns order 
    my @return;
    for my $row (@{$sth->fetchall_arrayref()}) { 
        my %data;
        tie %data, 'Tie::IxHash';
        $data{$columns[$_]} = $row->[$_] for (0 .. $#columns);
        for my $subquery (@{ $query->{subquery} }) { 
            my @bind;
            push @bind, $data{$_} for (@{ arrayref($subquery->{bind}) });
            $subquery->{bind_values} = \@bind;
            my $sub = fetch($subquery);

            # Present output as a list 
            if ($subquery->{list}) {
                #if ( map ( $query->{list} eq $_ , keys $sub ) )
                my @list;
                for (@$sub) {
                    push @list, $_->{$subquery->{list}};
                }
                $sub = \@list;
            }

            if ($subquery->{key}) {
                $data{$subquery->{key}} = $sub;
            } else {
                die "[Error] Key is missing for query '$subquery->{sql}'";
            }
        }

        # Remove unwanted columns from the output
        if ($query->{hide}) {
            delete $data{$_} for( @{ arrayref($query->{hide}) } );
        }        

        push @return, \%data;
    }

    \@return;    
}

DumpYaml($out);

sub arrayref {
   my $ref = shift;
   return (ref $ref ne 'ARRAY') ? [$ref] : $ref; 
}

sub DumpYaml {
    # I am not happy with this current dumper. I cannot specify the indent and it does 
    # not preserve the extraction order
    print Dump shift;
}

__DATA__
---
connect: "dbi:SQLite:dbname=einstein-puzzle.sqlite"
ident: 4
query:
   - sql: SELECT * from people
     hide:
        - pet_id
        - house_id
        - id
     subquery:
        - key: brevage
          bind: id
          sql: |
                SELECT name, calories, potassium FROM drink
                LEFT JOIN people_has_drink ON drink.id = people_has_drink.id_drink
                WHERE people_has_drink.id_people = ?       
          hash:
             - name
        - key: house
          sql: SELECT color as paint, size, id from house WHERE id = ?
          hide: id
          bind: house_id
          subquery:
             - key: color
               sql: SELECT short, ral, hex from color WHERE short LIKE ?
               bind: paint
        - key: pet
          sql: SELECT name from pet WHERE id = ?
          bind: pet_id
          list: name  

And this is what output I get:

---
- brevage:
    - calories: 0
      name: water
      potassium: 3 mg
    - calories: 45
      name: orange juice
      potassium: 200 mg
  house:
    - color:
        - hex: '#F8F32B'
          ral: RAL 1018
          short: yellow
      paint: yellow
      size: small
  nationality: Norvegian
  pet:
    - cats
  smoke: Dunhill
- brevage:
    - calories: 42
      name: milk
      potassium: 150 mg
  house:
    - color:
        - hex: '#CB2821'
          ral: RAL 2002
          short: red
      paint: red
      size: big
  nationality: Brit
  pet:
    - birds
    - phasmatodea
  smoke: Pall Mall

Database

My test databse is a sqlite db where the tables are listed below:

Table People
.----+-------------+----------+--------+-----------.
| id | nationality | house_id | pet_id | smoke     |
+----+-------------+----------+--------+-----------+
|  1 | Norvegian   |        4 |      3 | Dunhill   |
|  2 | Brit        |        1 |      2 | Pall Mall |
'----+-------------+----------+--------+-----------'
Table Drink
.----+--------------+----------+-----------.
| id | name         | calories | potassium |
+----+--------------+----------+-----------+
|  1 | tea          |        1 | 18 mg     |
|  2 | coffee       |        0 | 49 mg     |
|  3 | milk         |       42 | 150 mg    |
|  4 | beer         |       43 | 27 mg     |
|  5 | water        |        0 | 3 mg      |
|  6 | orange juice |       45 | 200 mg    |
'----+--------------+----------+-----------'
Table People Has Drink
.-----------+----------.
| id_people | id_drink |
+-----------+----------+
|         1 |        5 |
|         1 |        6 |
|         2 |        3 |
'-----------+----------'
Table House
+----+--------+--------+
| id | color  |  size  |
+----+--------+--------+
|  1 | red    | big    |
|  2 | green  | small  |
|  3 | white  | middle |
|  4 | yellow | small  |
|  5 | blue   | huge   |
+----+--------+--------+
Table Color
.--------+-------------+----------+---------.
| short  |    color    |   ral    |   hex   |
+--------+-------------+----------+---------+
| red    | Vermilion   | RAL 2002 | #CB2821 |
| green  | Pale green  | RAL 6021 | #89AC76 |
| white  | Light grey  | RAL 7035 | #D7D7D7 |
| yellow | Zinc yellow | RAL 1018 | #F8F32B |
| blue   | Capri blue  | RAL 5019 | #1B5583 |
'--------+-------------+----------+---------'
Table Pet
+----+-------------+
| id |    name     |
+----+-------------+
|  1 | dogs        |
|  2 | birds       |
|  3 | cats        |
|  4 | horses      |
|  5 | fishes      |
|  2 | phasmatodea |
+----+-------------+

Database data

If you wish use the same data as mine also give you all what you need:

BEGIN TRANSACTION;
CREATE TABLE "pet" (
    `id`    INTEGER,
    `name`  TEXT
);
INSERT INTO `pet` VALUES (1,'dogs');
INSERT INTO `pet` VALUES (2,'birds');
INSERT INTO `pet` VALUES (3,'cats');
INSERT INTO `pet` VALUES (4,'horses');
INSERT INTO `pet` VALUES (5,'fishes');
INSERT INTO `pet` VALUES (2,'phasmatodea');
CREATE TABLE `people_has_drink` (
    `id_people` INTEGER NOT NULL,
    `id_drink`  INTEGER NOT NULL,
    PRIMARY KEY(id_people,id_drink)
);
INSERT INTO `people_has_drink` VALUES (1,5);
INSERT INTO `people_has_drink` VALUES (1,6);
INSERT INTO `people_has_drink` VALUES (2,3);
CREATE TABLE "people" (
    `id`    INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    `nationality`   VARCHAR(45),
    `house_id`  INT,
    `pet_id`    INT,
    `smoke` VARCHAR(45)
);
INSERT INTO `people` VALUES (1,'Norvegian',4,3,'Dunhill');
INSERT INTO `people` VALUES (2,'Brit',1,2,'Pall Mall');
CREATE TABLE "house" (
    `id`    INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    `color` TEXT,
    `size`  TEXT
);
INSERT INTO `house` VALUES (1,'red','big');
INSERT INTO `house` VALUES (2,'green','small');
INSERT INTO `house` VALUES (3,'white','middle');
INSERT INTO `house` VALUES (4,'yellow','small');
INSERT INTO `house` VALUES (5,'blue','huge');
CREATE TABLE `drink` (
    `id`    INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE,
    `name`  TEXT,
    `calories`  INTEGER,
    `potassium` TEXT
);
INSERT INTO `drink` VALUES (1,'tea',1,'18 mg');
INSERT INTO `drink` VALUES (2,'coffee',0,'49 mg');
INSERT INTO `drink` VALUES (3,'milk',42,'150 mg');
INSERT INTO `drink` VALUES (4,'beer',43,'27 mg');
INSERT INTO `drink` VALUES (5,'water',0,'3 mg');
INSERT INTO `drink` VALUES (6,'orange juice',45,'200 mg');
CREATE TABLE `color` (
    `short` TEXT UNIQUE,
    `color` TEXT,
    `ral`   TEXT,
    `hex`   TEXT,
    PRIMARY KEY(short)
);
INSERT INTO `color` VALUES ('red','Vermilion','RAL 2002','#CB2821');
INSERT INTO `color` VALUES ('green','Pale green','RAL 6021','#89AC76');
INSERT INTO `color` VALUES ('white','Light grey','RAL 7035','#D7D7D7');
INSERT INTO `color` VALUES ('yellow','Zinc yellow','RAL 1018','#F8F32B');
INSERT INTO `color` VALUES ('blue','Capri blue','RAL 5019','#1B5583');
COMMIT;
like image 821
nowox Avatar asked Jun 06 '15 14:06

nowox


1 Answers

Is my implementation good

This is a rather broad question, and the answer probably depends on what you want from your code. For instance:

Does it work? Does it have all the features you need? Does it do what you want? Does it respond appropriately for all the ranges of inputs you want to cater for (and input you don't)? If you aren't sure, write some tests.

Is it fast enough? If not, what are the slow bits? Use Devel::NYTProf to find them.

If it's working, you probably also want to turn your code into a module rather than just a script so you can use it again.

and if not (I'm supposing that I am doing all wrong), what modules should I use to get the desired behavior?

It sounds very much like you're trying to do something like DBIx::Class (aka DBIC) does when you ask it to prefetch; it will build you a data structure of objects.

If you need to do this dynamically in response to arbitrary databases and YAML, that's not quite what DBIC was designed to do; it's probably possible but will probably involve you dynamically creating packages, which will not be easy.

like image 77
user52889 Avatar answered Nov 15 '22 11:11

user52889