maintaing a sorted list that is bigger than memory

Question

I have a list of tuples.

[
  "Bob": 3,
  "Alice": 2,
  "Jane": 1,
]

When incrementing the counts

 "Alice" += 2

the order should be maintained:

[
  "Alice": 4,
  "Bob": 3,
  "Jane": 1,
]

When all is in memory there rather simple ways (some more or some less) to efficiently implement this. (using an index, insert-sort etc) The question though is: What's the most promising approach when the list does not fit into memory.

Bonus question: What if not even the index fits into memory?

How would you approach this?

mdma · Accepted Answer

B+ trees order a number of items using a key. In this case, the key is the count, and the item is the person's name. The entire B+tree doesn't need to fit into memory - just the current node being searched. You can set the maximum size of the nodes (and indirectly the depth of the tree) so that a node will fit into memory. (In practice nodes are usually far smaller than memory capacity.)

The data items are stored at the leaves of the tree, in so-called blocks. You can either store the items inline in the index, or store pointers to external storage. If the data is regularly sized, this can make for efficient retrieval from files. In the question example, the data items could be single names, but it would be more efficient to store blocks of names, all names in a block having the same count. The names within each block could also be sorted. (The names in the blocks themselves could be organized as a B-tree.)

If the number of names becomes large enough that the B+tree blocks are becoming ecessively large, the key can be made into a composite key, e.g. (count, first-letter). When searching the tree, only the count needs to be compared to find all names with that count. When inserting, or searcing for a specific name with a given count, then the full key can be compared to include filtering by name prefix.

Alternatively, instead of a composite key, the data items can point to offsets/blocks in an external file that contains the blocks of names, which will keep the B+tree itself small.

If the blocks of the btree are linked together, range queries can be efficiently implemented by searching for the start of the range, and then following block pointers to the next block until the end of the range is reached. This would allow you to efficiently implement "find all names with a count between 10 and 20".

As the other answers have noted, an RDBMS is the pre-packaged way of storing lists that don't fit into memory, but I hope this gives an insight into the structures used to solve the problem.

Daniel Trebbien · Answer

A relational database such as MySQL is specifically designed for storing large amounts of data the sum of which does not fit into memory, querying against this large amount of data, and even updating it in place.

For example:

CREATE TABLE `people` (
    `name`    VARCHAR(255),
    `count`   INT
);

INSERT INTO `people` VALUES
('Bob', 3),
('Alice', 2),
('Jane', 1);

UPDATE `people` SET `count` = `count` + 2;

After the UPDATE statement, the query SELECT * FROM people; will show:

+-------+-------+
| name  | count |
+-------+-------+
| Bob   |     5 |
| Alice |     4 |
| Jane  |     3 |
+-------+-------+

You can save the order of people in your table by adding an autoincrementing primary key:

CREATE TABLE `people` (
    `id`      INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `name`    VARCHAR(255),
    `count`   INT,

    PRIMARY KEY(`id`)
);

INSERT INTO `people` VALUES
(DEFAULT, 'Bob', 3),
(DEFAULT, 'Alice', 2),
(DEFAULT, 'Jane', 1);

maintaing a sorted list that is bigger than memory

Tags:

algorithm

sorting

memory

tcurdt

2 Answers

mdma

Daniel Trebbien

Recent Activity

Donate For Us

maintaing a sorted list that is bigger than memory

Tags:

algorithm

sorting

memory

tcurdt

2 Answers

mdma

Daniel Trebbien

Related questions

Recent Activity

Donate For Us