Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient algorithm for merging small overlaping blocks into larger contiguous blocks?

i am faced with a rather interesting problem. I have a (fairly large) number of blocks. A block is simply something starting at an offset and having a length and a color. The offset and length are limited - the space where these blocks exist is <0-N> where N ranges from hundreds of thousands to a few million. An invalid block is any block with offset larger than N or a sum of offset and length larger than N. There are about 16 different colors a block might have (just one of them).

There might be thousands of blocks and there are always situations like this:

block_X: off: 100, len: 50, color: blue
block_Y: off: 148, len: 50, color: blue
block_Z: off: 200, len: 30, color: red

As you can see, the X and Y blocks can be connected into one single larger block, resulting in this:

block_XY: off: 100, len 98, color: blue
block_Z: off 200, len 30, color: red

I would like to create an algorithm that would go through all the blocks and connect those that overlap and have the same color. Actually, if the gap between the blocks is reasonably small (a constant i could choose, say about 16 or so, but may be any number...) then i would like to connect those blocks anyway. Note that in the above example, there are just two blocks that get connected. In reality, the sequence of blocks that can be connected is most of the time much longer.

There is also an interesting twist, consider this:

block_A: off: 100, len: 200, color: blue
block_B: off: 200, len: 200, color: blue
block_C: off: 300, len: 150, color: red
block_D: off: 400, len: 200, color: blue
block_E: off: 500, len: 200, color: blue

As you can see, we have a nice sequence of blue blocks that can be merged into one large blue block. However, there is also a small red block in the middle of it. This should not fool the algorithm, the correct result is:

block_ABDE: off: 100, len: 600, color: blue
block_C: off: 300, len: 150, color: red

The data are stored in a std::multimap where the key is the offset and the value is a structure containing length and color of the block, something like: multimap<uint32_t,pair<uint32_t, uint8_t> >. Note that there might be blocks starting at the same offset. It is guaranteed however, that if such thing happens, the blocks starting at the same offset have different colors AND different lengths.

Can anyone come up with a nice idea how to solve this problem efficiently? The goal of the algorithm is to reduce the number of blocks we have, but it is NOT required to reduce it to the smallest amount possible.

like image 698
PeterK Avatar asked Dec 04 '22 09:12

PeterK


2 Answers

I would suggest the following:

  1. Create a separate list for each of the colors
  2. For each specific color, calculate the beginning (offset) and the end (offset + length) of all blocks
  3. Sort each color-specific list by value of beginning
  4. Now, traverse each of the lists, merging the items: if the next item's beginning is less or equal than the previous item's end (plus the allowed gap), remove the next item and extend the previous one.
like image 185
Vlad Avatar answered Jan 08 '23 11:01

Vlad


Separate the list of blocks into a list for each color, and sort all of those lists by the starting offset of the block. (Actually, you'll probably want to do an insertion sort as you're filtering based on color, or sort by offset and then do a stable sort by color and work with partitions of your array.)

Once you have the per-color lists, you'll iterate over each one: Starting with the first block, check if the offset of the next block is close enough to the end of your current block, and if so, merge them. Then continue to the end of the list.

Now, you've combined all the blocks you can of each color, so you can just concatenate your lists to get the final list of all blocks of all colors. The most expensive step (asymptotically) will be the sorting, so this is probably efficient enough. You might be able to come up with something that is faster on average using more advanced data structures than arrays and linked lists, but won't be worth the trouble until you measure the performance of this simpler approach.

Note that because the rules for whether a two blocks can be merged depend only on the end point of one block and the start point of the other, and do not depend on the size of the blocks, any solution that can identify any potential merges will probably identify all merges, and it won't matter what order the merges are done in (that is, merging is an associative operation).

like image 37
user57368 Avatar answered Jan 08 '23 09:01

user57368