I have a fairly stable directed graph of order ~100k vertices and size ~1k edges. It is two-dimensional insofar as its vertices can be identified by a pair of integers (x, y)
(of cardinality ~100 x ~1000) and all edges are strictly increasing in x
.
There is furthermore a dictionary of ~1k (key, val)
pairs associated with each vertex.
I am currently storing the graph in a MySQL database across three (InnoDB) tables: a table of vertices (which I don't think is relevant to my question, so I have omitted to include both it and the foreign key constraints that refer to it in my extracts below); a table which holds the dictionaries; and a 'closure table' of connected vertices as described so eloquently by Bill Karwin.
The table of vertex dictionaries is defined as follows:
CREATE TABLE `VertexDictionary` (
`x` smallint(6) unsigned NOT NULL,
`y` smallint(6) unsigned NOT NULL,
`key` varchar(50) NOT NULL DEFAULT '',
`val` smallint(1) DEFAULT NULL,
PRIMARY KEY (`x`, `y` , `key`),
KEY `dict` (`x`, `key`, `val`)
);
and the closure table of connected vertices as:
CREATE TABLE `ConnectedVertices` (
`tail_x` smallint(6) unsigned NOT NULL,
`tail_y` smallint(6) unsigned NOT NULL,
`head_x` smallint(6) unsigned NOT NULL,
`head_y` smallint(6) unsigned NOT NULL,
PRIMARY KEY (`tail_x`, `tail_y`, `head_x`),
KEY `reverse` (`head_x`, `head_y`, `tail_x`),
KEY `fx` (`tail_x`, `head_x`),
KEY `rx` (`head_x`, `tail_x`)
);
There is also a dictionary of (x, key)
pairs such that for each such pair, all vertices identified with that x
have within their dictionaries a value for that key
. This dictionary is stored in a fourth table:
CREATE TABLE `SpecialKeys` (
`x` smallint(6) unsigned NOT NULL,
`key` varchar(50) NOT NULL DEFAULT '',
PRIMARY KEY (`x`),
KEY `xkey` (`x`, `key`)
);
I often wish to extract the set of keys used in the dictionaries of all vertices having a particular x=X
, together with the associated value of any SpecialKeys
connected to the left:
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`ConnectedVertices` AS `c`
JOIN `VertexDictionary` AS `u` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
JOIN `VertexDictionary` AS `v` ON (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
JOIN `SpecialKeys` AS `k` ON (`k`.`x`, `k`.`key`) = (`u`.`x`, `u`.`key`)
WHERE
`v`.`x` = X
;
for which the EXPLAIN
output is:
id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE k index PRIMARY,xkey xkey 154 NULL 40 Using index; Using temporary 1 SIMPLE c ref PRIMARY,reverse,fx,rx PRIMARY 2 db.k.x 1 Using where 1 SIMPLE v ref PRIMARY,dict PRIMARY 4 const,db.c.head_y 136 Using index 1 SIMPLE u eq_ref PRIMARY,dict PRIMARY 156 db.c.tail_x,db.c.tail_y,db.k.key 1 Using where
But this query takes ~10s to complete. Been banging my head against a brick wall trying to improve matters, but to no avail.
Can the query be improved, or should I consider a different data structure? Extremely grateful for your thoughts!
UPDATE
I'm still getting nowhere with this, although I did rebuild the tables and found the EXPLAIN
output to be slightly different (as now shown above, the number of rows fetched from v
had increased from 1 to 136!); the query is still taking ~10s to execute.
I really don't understand what's going on here. Queries to obtain all (x, y, SpecialValue)
and all (x, y, key)
tuples are both very fast (~30ms and ~150ms respectively), yet essentially joining the two takes over fifty times longer than their combined time... how can I improve the time taken to perform that join?
Output of SHOW VARIABLES LIKE '%innodb%';
below:
Variable_name Value ------------------------------------------------------------ have_innodb YES ignore_builtin_innodb ON innodb_adaptive_flushing ON innodb_adaptive_hash_index ON innodb_additional_mem_pool_size 2097152 innodb_autoextend_increment 8 innodb_autoinc_lock_mode 1 innodb_buffer_pool_size 1179648000 innodb_change_buffering inserts innodb_checksums ON innodb_commit_concurrency 0 innodb_concurrency_tickets 500 innodb_data_file_path ibdata1:10M:autoextend innodb_data_home_dir /rdsdbdata/db/innodb innodb_doublewrite ON innodb_fast_shutdown 1 innodb_file_format Antelope innodb_file_format_check Barracuda innodb_file_per_table ON innodb_flush_log_at_trx_commit 1 innodb_flush_method O_DIRECT innodb_force_recovery 0 innodb_io_capacity 200 innodb_lock_wait_timeout 50 innodb_locks_unsafe_for_binlog OFF innodb_log_buffer_size 8388608 innodb_log_file_size 134217728 innodb_log_files_in_group 2 innodb_log_group_home_dir /rdsdbdata/log/innodb innodb_max_dirty_pages_pct 75 innodb_max_purge_lag 0 innodb_mirrored_log_groups 1 innodb_old_blocks_pct 37 innodb_old_blocks_time 0 innodb_open_files 300 innodb_read_ahead_threshold 56 innodb_read_io_threads 4 innodb_replication_delay 0 innodb_rollback_on_timeout OFF innodb_spin_wait_delay 6 innodb_stats_method nulls_equal innodb_stats_on_metadata ON innodb_stats_sample_pages 8 innodb_strict_mode OFF innodb_support_xa ON innodb_sync_spin_loops 30 innodb_table_locks ON innodb_thread_concurrency 0 innodb_thread_sleep_delay 10000 innodb_use_sys_malloc ON innodb_version 1.0.16 innodb_write_io_threads 4
Adjust the size and properties of the memory areas that MySQL uses for caching. With efficient use of the InnoDB buffer pool, MyISAM key cache, and the MySQL query cache, repeated queries run faster because the results are retrieved from memory the second and subsequent times.
MySQL 8+ with recursive cte (id, name, parent_id) as ( select id, name, parent_id from products where parent_id = 19 union all select p.id, p.name, p. parent_id from products p inner join cte on p.
Use hierarchyid as a data type to create tables with a hierarchical structure, or to describe the hierarchical structure of data that is stored in another location. Use the hierarchyid functions in Transact-SQL to query and manage hierarchical data.
Without spending time testing it, you provided an incomplete example? you should definitely try reordering of joined tables. Explain output provides some info, let's say ordering by key_len should be heuristically fastest. First table to be filtered on should be listed as last in case the optimizer is not able to figure that out, I believe.
So, let's say 'c, v, k, u' order is the best.
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`VertexDictionary` AS `u`
JOIN `SpecialKeys` AS `k` ON (`k`.`x`, `k`.`key`) = (`u`.`x`, `u`.`key`)
JOIN `VertexDictionary` AS `v`
JOIN `ConnectedVertices` AS `c` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
AND (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
WHERE
`v`.`x` = X
;
'rows' would suggest 'c/u, k, v' order, but that depends on data:
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`VertexDictionary` AS `u`
JOIN `VertexDictionary` AS `v`
JOIN `SpecialKeys` AS `k` ON (`k`.`x`, `k`.`key`) = (`u`.`x`, `u`.`key`)
JOIN `ConnectedVertices` AS `c` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
AND (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
WHERE
`v`.`x` = X
;
Hope this helps.
UPDATE (avoiding the varchar join):
SELECT DISTINCT
`v`.`key`,
`u`.`val`
FROM
`ConnectedVertices` AS `c`
JOIN `VertexDictionary` AS `u` ON (`u`.`x`, `u`.`y` ) = (`c`.`tail_x`, `c`.`tail_y`)
JOIN `VertexDictionary` AS `v` ON (`v`.`x`, `v`.`y` ) = (`c`.`head_x`, `c`.`head_y`)
WHERE
(`u`.`x`, `u`.`key`) IN (SELECT `k`.`x`, `k`.`key` FROM `SpecialKeys` AS `k`)
AND
`v`.`x` = X
;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With