Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting minimum number of swaps to group characters in string

I'm trying to solve a pretty complex problem with strings:

Given is a string with up to 100000 characters, consisting of only two different characters 'L' and 'R'. A sequence 'RL' is considered "bad", and such occurrences have to be reduced by applying swaps.

However, the string is to be considered circular, so even the string 'LLLRRR' has an 'RL' sequence formed by the last 'R' and the first 'L'.

Swaps of two consecutive elements can be made. So we can swap only elements that are on positions i and i+1, or on position 0 and n-1, if n is the length of the string (the string is 0-indexed).

The goal is to find the minimum number of swaps needed to leave only one bad connection in the string.

Example

For the string 'RLLRRL' the problem can be solved with exactly one swap: swap the first and the last characters (since the string is circular). The string will thus become 'LLLRRR' with one bad connection.

What I tried

My idea is to use dynamical programming, and to calculate for any given 'L' how many swaps are needed to put all other 'L's left to that 'L', and, alternatively, to the right of this 'L'. For any 'R' I calculate the same.

This algorithm works in O(N) time, but it doesn't give the correct result.

It doesn't work when I have to swap the first and the last elements. What should I add to my algorithm to make it work also for those swaps?

like image 250
someone12321 Avatar asked May 06 '17 12:05

someone12321


People also ask

How do you count number of swaps?

In ascending order: In Bubble sort, the largest element moves to the right. So swapping is done, when a smaller element is found on the right side. So to count the number of swaps for an element, just count the number of elements on the right side that are smaller than it.

Which sorting gives minimum number of swaps?

Selection Sort requires the minimum number of swaps. Based on Number of Comparisons This is the number of times the algorithm compares elements to sort the input.

What is the minimum necessary number of swaps of Neighbouring Letters to achieve this?

Minimum 4 swaps are required.


2 Answers

The problem can be solved in linear time.

Some observations and definitions:

  • The target of having only one bad connection is another way of saying that the L letters should all be grouped together, as well as the R letters (in a circular string)

  • Let a group denote a series of subsequent letters of the same kind that cannot be made larger (because of surrounding letters that differ). By combining individual swaps you can "move" a group with one or more "steps". An example -- I will write . instead of L so it is more easily readable:

    RRR...RR....
    

    There are 4 groups here: RRR, ..., RR and ..... Suppose you want to join the group of two "R" with the left-sided "R" group in the above string. Then you could "move" that middle group with 3 steps to the left by performing 6 swaps:

    RRR...RR....
    RRR..R.R....
    RRR..RR.....
    RRR.R.R.....
    RRR.RR......
    RRRR.R......
    RRRRR.......
    

    These 6 swaps are what constitutes one group move. The cost of the move is 6, and is the product of the size of the group (2) and the distance it travels (3). Note that this move is exactly the same as when we would have moved the group with three "L" characters (cf. the dots) to the right.

    I will use the word "move" in this meaning.

  • There is always a solution that can be expressed as a series of group moves, where each group move reduces the number of groups with two, i.e. with each such move, two R groups are merged into one, and consequently also two L groups are merged. In other words, there is always a solution where none of the groups has to split with one part of it moving to the left and another to the right. I will not give a proof of this claim here.

  • There is always a solution that has one group that will not move at all: all other groups of the same letter will move towards it. As a consequence there is also a group of the opposite letter that will not move, somewhere at the other end of the circle. Again, I will not prove this here.

  • The problem is then equivalent to minimizing the total cost (swaps) of the moves of the groups that represent one of the two letters (so half of all the groups). The other half of the groups move at the same time as was shown in the above example.

Algorithm

An algorithm could go like this:

Create an array of integers, where each value represents the size of a group. The array would list the groups in the order they appear. This would take the circular property into account, so that the first group (with index 0) would also account for the letter(s) at the very end of the string that are the same as the first letter(s). So at even indices you would have groups that represent counts of one particular letter, and at odd indices there would be counts of the other letter. It does not really matter which of the two letters they represent. The array of groups will always have an even number of entries. This array is all we need to solve the problem.

Pick the first group (index 0), and assume it will not move. Call it the "middle group". Determine which is the group of the opposite colour (with odd index) that will not have to move either. Call this other group the "split group". This split group will split the remaining odd groups into two sections, where the sum of their values (counts) is each less or equal the total of both sums. This represents the fact that it is cheaper for even groups to move in one direction than in the other in order to merge with the group at index 0.

Now determine the cost (number of swaps) for moving all even groups towards the middle group.

This may or may not be a solution, since the choice of the middle group was arbitrary.

The above would have to be repeated for the scenarios where any of the other even groups was taken as middle group.

Now the essence of the algorithm is to avoid redoing the whole operation when taking another group as the middle group. It turns out it is possible to take the next even group as middle group (at index 2) and adjust the previously calculated cost in constant time (on average) to arrive at the cost for this choice of the middle group. For this one has to keep a few parameters in memory: the cost for performing the moves in the left direction, and the cost for performing the moves in the right direction. Also the sum of even-group sizes needs to be maintained for each of both directions. And finally the sum of the odd-group sizes needs to be maintained as well for both directions. Each of these parameters can be adjusted when taking the next even group as middle group. Often the corresponding split group has to be re-identified as well, but also that can happen on average in constant time.

Without going to deep into this, here is a working implementation in simple JavaScript:

Code

function minimumSwaps(s) {
    var groups, start, n, i, minCost, halfSpace, splitAt, space,
        cost, costLeft, costRight, distLeft, distRight, itemsLeft, itemsRight;
    // 1. Get group sizes 
    groups = [];
    start = 0;
    for (i = 1; i < s.length; i++) {
        if (s[i] != s[start]) {
            groups.push(i - start);
            start = i;
        }
    }
    // ... exit when the number of groups is already optimal
    if (groups.length <= 2) return 0; // zero swaps
    // ... the number of groups should be even (because of circle)
    if (groups.length % 2 == 1) { // last character not same as first
        groups.push(s.length - start);
    } else { // Ends are connected: add to the length of first group
        groups[0] += s.length - start;
    }
    n = groups.length;
    // 2. Get the parameters of the scenario where group 0 is the middle:
    //    i.e. the members of group 0 do not move in that case.
    // Get sum of odd groups, which we consider as "space", while even 
    // groups are considered items to be moved.
    halfSpace = 0;
    for (i = 1; i < n; i+=2) {
        halfSpace += groups[i];
    }
    halfSpace /= 2;
    // Get split-point between what is "left" from the "middle" 
    // and what is "right" from it:
    space = 0;
    for (i = 1; space < halfSpace; i+=2) {
        space += groups[i];
    }
    splitAt = i-2;
    // Get sum of items, and cost, to the right of group 0
    itemsRight = distRight = costRight = 0;
    for (i = 2; i < splitAt; i+=2) {
        distRight += groups[i-1];
        itemsRight += groups[i];
        costRight += groups[i] * distRight;
    }
    // Get sum of items, and cost, to the left of group 0
    itemsLeft = distLeft = costLeft = 0;
    for (i = n-2; i > splitAt; i-=2) {
        distLeft += groups[i+1];
        itemsLeft += groups[i];
        costLeft += groups[i] * distLeft;
    }
    cost = costLeft + costRight;
    minCost = cost;
    // 3. Translate the cost parameters by incremental changes for 
    //    where the mid-point is set to the next even group
    for (i = 2; i < n; i += 2) {
        distLeft += groups[i-1];
        itemsLeft += groups[i-2];
        costLeft += itemsLeft * groups[i-1];
        costRight -= itemsRight * groups[i-1];
        itemsRight -= groups[i];
        distRight -= groups[i-1];
        // See if we need to change the split point. Items that get 
        // at the different side of the split point represent items
        // that have a shorter route via the other half of the circle.
        while (distLeft >= halfSpace) {
            costLeft -= groups[(splitAt+1)%n] * distLeft;
            distLeft -= groups[(splitAt+2)%n];
            itemsLeft -= groups[(splitAt+1)%n];
            itemsRight += groups[(splitAt+1)%n];
            distRight += groups[splitAt];
            costRight += groups[(splitAt+1)%n] * distRight;
            splitAt = (splitAt+2)%n;
        }
        cost = costLeft + costRight;
        if (cost < minCost) minCost = cost;
    }
    return minCost;
}

function validate(s) {
    return new Set(s).size <= 2; // maximum 2 different letters used
}

// I/O
inp.oninput = function () {
    var s, result, start;
    s = inp.value;
    start = performance.now(); // get timing
    if (validate(s)) {
        result = minimumSwaps(s); // apply algorithm
    } else {
        result = 'Please use only 2 different characters';
    }
    outp.textContent = result;
    ms.textContent = Math.round(performance.now() - start);
}

rnd.onclick = function () {
    inp.value = Array.from(Array(100000), _ => 
                    Math.random() < 0.5 ? "L" : "R").join('');
    if (inp.value.length != 100000) alert('Your browser truncated the input!');
    inp.oninput(); // trigger input handler
}

inp.oninput(); // trigger input handler
input { width: 100% }
<p>
    <b>Enter LR series:</b>
    <input id="inp" value="RLLRRL"><br>
    <button id="rnd">Produce random of size 100000</button>
</p><p>
    <b>Number of swaps: </b><span id="outp"></span><br>
    <b>Time used: </b><span id="ms"></span>ms
</p>

Time Complexity

The preprocessing (creating the groups array, etc..), and calculation of the cost for when the first group is the middle group, all consist of non-nested loops with at most n iterations, so this part is O(n).

The calculation of the cost for when the middle group is any of the other even groups consists of a loop (for the choosing the middle group), and another inner loop for adjusting the choice of the split group. Even though this inner loop may iterate multiple times for one iteration of the outer loop, in total this inner loop will not have more iterations than n, so the total execution time of this outer loop is still O(n).

Therefore the time complexity is O(n).

Note that the result for a string of 100 000 characters is calculated in a fraction of a second (see the number of milliseconds displayed by the above snippet).

like image 67
trincot Avatar answered Sep 17 '22 11:09

trincot


The task is to re-order the items in a circular list like this:

LRLLLRLLLRRLLRLLLRRLRLLLRLLRLRLRLRRLLLRRRLRLLRLLRL  

so that we get a list like this:

RRRRRRLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRR  

or this:

LLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRRRRRLLLLL  

where the two types of items are grouped together, but the exact position of these two groups is not important.

The first task is to count the number of items in each group, so we iterate over the list once, and for the example above the result would be:

#L = 30  
#R = 20  

Then, the simple brute-force solution would be to consider every position in the list as the start of the L-zone, starting with position 0, iterate over the whole list and count how many steps each item is away from the border of the zone where it should be:

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRRRRR  <- desired output  
LRLLLRLLLRRLLRLLLRRLRLLLRLLRLRLRLRRLLLRRRLRLLRLLRL  <- input  
 <   <   <<  <   >> >   >  > >< <  <<<   > >> >> >  <- direction to move  

We would then consider the L-zone to start at position 1, and do the whole calculation again:

RLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRRRR  <- desired output  
LRLLLRLLLRRLLRLLLRRLRLLLRLLRLRLRLRRLLLRRRLRLLRLLRL  <- input  
<<   <   <<  <   >> >   >  > >  <  <<<   > >> >> >  <- direction to move  

After calculating the total number of steps for every position of the L-zone, we would know which position requires the least number of steps. This is of course a method with N2 complexity.

If we could calculate the number of required steps with the L-zone at position X, based on the calculation for the L-zone at position X-1 (without iterating over the whole list again), this could bring the complexity down to N.

To do so, we'd need to keep track of the number of wrong items in each half of each zone, and the total number of steps for the wrong items in each of these four half-zones:

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRRRRR  <- desired output  
<<<<<<<<<<<<<<<>>>>>>>>>>>>>>><<<<<<<<<<>>>>>>>>>>  <- half-zones
LRLLLRLLLRRLLRLLLRRLRLLLRLLRLRLRLRRLLLRRLRRLLRLLRL  <- input  
 <   <   <<  <   >> >   >  > >< <  <<<  >  >> >> >  <- direction to move  
        5              6           5         6      <- wrong items
       43             45          25        31      <- required steps  

When we move right to the next position, the total number of steps in left-moving zones will decrease by the number of wrong items in that zone, and the total number of steps in right-moving zones will increase by the number of wrong items in that zone (because every item is now one step closer/further from the edge of the zone.

        5              6           5         6      <- wrong items
       38             51          20        37      <- required steps  

However, we need to check the four border-points to see whether any wrong items have moved from one half-zone to another, and adjust the item and step count accordingly.

In the example, the L that was the first item of the L-zone has now become the last item in the R-zone, so we increment the R> half-zone's item and step count to 7 and 38.
Also, the L that was the first item in the R-zone, has become the last item of the L zone, so we decrement the item count for the R< half-zone to 4.
Also, the L in the middle of the R-zone has moved from the R> to the R< half-zone, so we decrement and increment the item counts for R> and R< to 6 and 5, and decrease and increase the step counts with 10 (the length of the R> and R< half-zones) to 28 and 30.

RLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLRRRRRRRRRRRRRRRRRRR  <- desired output  
><<<<<<<<<<<<<<<>>>>>>>>>>>>>>><<<<<<<<<<>>>>>>>>>  <- half-zones
LRLLLRLLLRRLLRLLLRRLRLLLRLLRLRLRLRRLLLRRLRRLLRLLRL  <- input  
><   <   <<  <   >> >   >  > >  <  <<<  <  >> >> >  <- direction to move  
         5              6           5         6     <- wrong items
        38             51          30        28     <- required steps  

So the total number of required steps when the L-zone starts at position 0 was 144, and we have calculated that when the L-zone starts at position 1, the total is now 147, by looking at what happens at four positions in the list, instead of having to iterate over the whole list again.


UPDATE

While thinking of how to implement this, I realised that the number of wrong items moving right in a zone must be the same as the number of wrong items moving left in the other zone; otherwise the border between the zones ends up in the wrong place. This means that the L and R zones aren't split into two half-zones of equal length, and the "mid" point in a zone moves according to how many wrong items are to the left and right of it. I still think it's possible to turn this into working code with O(N) efficiency, but it's probably not as straightforward as I first described it.

like image 33