Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

maintaing a sorted list that is bigger than memory

I have a list of tuples.

[
  "Bob": 3,
  "Alice": 2,
  "Jane": 1,
]

When incrementing the counts

 "Alice" += 2

the order should be maintained:

[
  "Alice": 4,
  "Bob": 3,
  "Jane": 1,
]

When all is in memory there rather simple ways (some more or some less) to efficiently implement this. (using an index, insert-sort etc) The question though is: What's the most promising approach when the list does not fit into memory.

Bonus question: What if not even the index fits into memory?

How would you approach this?

like image 632
tcurdt Avatar asked Dec 02 '22 05:12

tcurdt


2 Answers

B+ trees order a number of items using a key. In this case, the key is the count, and the item is the person's name. The entire B+tree doesn't need to fit into memory - just the current node being searched. You can set the maximum size of the nodes (and indirectly the depth of the tree) so that a node will fit into memory. (In practice nodes are usually far smaller than memory capacity.)

The data items are stored at the leaves of the tree, in so-called blocks. You can either store the items inline in the index, or store pointers to external storage. If the data is regularly sized, this can make for efficient retrieval from files. In the question example, the data items could be single names, but it would be more efficient to store blocks of names, all names in a block having the same count. The names within each block could also be sorted. (The names in the blocks themselves could be organized as a B-tree.)

If the number of names becomes large enough that the B+tree blocks are becoming ecessively large, the key can be made into a composite key, e.g. (count, first-letter). When searching the tree, only the count needs to be compared to find all names with that count. When inserting, or searcing for a specific name with a given count, then the full key can be compared to include filtering by name prefix.

Alternatively, instead of a composite key, the data items can point to offsets/blocks in an external file that contains the blocks of names, which will keep the B+tree itself small.

If the blocks of the btree are linked together, range queries can be efficiently implemented by searching for the start of the range, and then following block pointers to the next block until the end of the range is reached. This would allow you to efficiently implement "find all names with a count between 10 and 20".

As the other answers have noted, an RDBMS is the pre-packaged way of storing lists that don't fit into memory, but I hope this gives an insight into the structures used to solve the problem.

like image 62
mdma Avatar answered Dec 28 '22 04:12

mdma


A relational database such as MySQL is specifically designed for storing large amounts of data the sum of which does not fit into memory, querying against this large amount of data, and even updating it in place.

For example:

CREATE TABLE `people` (
    `name`    VARCHAR(255),
    `count`   INT
);

INSERT INTO `people` VALUES
('Bob', 3),
('Alice', 2),
('Jane', 1);

UPDATE `people` SET `count` = `count` + 2;

After the UPDATE statement, the query SELECT * FROM people; will show:

+-------+-------+
| name  | count |
+-------+-------+
| Bob   |     5 |
| Alice |     4 |
| Jane  |     3 |
+-------+-------+

You can save the order of people in your table by adding an autoincrementing primary key:

CREATE TABLE `people` (
    `id`      INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `name`    VARCHAR(255),
    `count`   INT,

    PRIMARY KEY(`id`)
);

INSERT INTO `people` VALUES
(DEFAULT, 'Bob', 3),
(DEFAULT, 'Alice', 2),
(DEFAULT, 'Jane', 1);
like image 39
Daniel Trebbien Avatar answered Dec 28 '22 02:12

Daniel Trebbien