Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

All possible permutations of decimal numbers (hundredths) that sum up to 1 for a given length

Consider vector s as follows:

s=seq(0.01, 0.99, 0.01)

> s
 [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 
0.08 0.09 .......... 0.89 0.90 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99

Now given s and a fixed length m, I would like to have a matrix for all possible permutations of length m such that each row of matrix sums up to 1 (excluding the brute force approach).

For example, if m=4 (i.e. number of columns), the desired matrix would be something like this:

0.01 0.01 0.01 0.97
0.02 0.01 0.01 0.96
0.03 0.01 0.01 0.95
0.04 0.01 0.01 0.94
0.05 0.01 0.01 0.93
0.06 0.01 0.01 0.92
.
.
.
0.53 0.12 0.30 0.05
.
.
.
0.96 0.02 0.01 0.01
0.97 0.01 0.01 0.01
.
.
.
0.01 0.97 0.01 0.01
.
.
.
like image 624
989 Avatar asked Jun 07 '16 18:06

989


3 Answers

Here's how to do this using recursion:

permsum <- function(s,m) if (m==1L) matrix(s) else do.call(rbind,lapply(seq_len(s-m+1L),function(x) unname(cbind(x,permsum(s-x,m-1L)))));
res <- permsum(100L,4L);
head(res);
##      [,1] [,2] [,3] [,4]
## [1,]    1    1    1   97
## [2,]    1    1    2   96
## [3,]    1    1    3   95
## [4,]    1    1    4   94
## [5,]    1    1    5   93
## [6,]    1    1    6   92
tail(res);
##           [,1] [,2] [,3] [,4]
## [156844,]   95    2    2    1
## [156845,]   95    3    1    1
## [156846,]   96    1    1    2
## [156847,]   96    1    2    1
## [156848,]   96    2    1    1
## [156849,]   97    1    1    1

You can divide by 100 to get fractions, as opposed to integers:

head(res)/100;
##      [,1] [,2] [,3] [,4]
## [1,] 0.01 0.01 0.01 0.97
## [2,] 0.01 0.01 0.02 0.96
## [3,] 0.01 0.01 0.03 0.95
## [4,] 0.01 0.01 0.04 0.94
## [5,] 0.01 0.01 0.05 0.93
## [6,] 0.01 0.01 0.06 0.92

Explanation

First let's define the inputs:

  • s This is the target value to which each row in the output matrix should sum.
  • m This is number of columns to produce in the output matrix.

It is more efficient and reliable to compute the result using integer arithmetic, as opposed to floating-point arithmetic, so I designed my solution to work only with integers. Hence s is a scalar integer representing the target integer sum.


Now let's examine the sequence generated by seq_len() for the non-base case:

seq_len(s-m+1L)

This generates a sequence from 1 to the highest possible value that could be part of a sum to s with m columns remaining. For example, think about the case of s=100,m=4: the highest number we can use is 97, participating in a sum of 97+1+1+1. Each remaining column reduces the highest possible value by 1, which is why we must subtract m from s when computing the sequence length.

Each element of the generated sequence should be viewed as one possible "selection" of an addend in the summation.


do.call(rbind,lapply(seq_len(s-m+1L),function(x) ...))

For each of the selections, we must then recurse. We can use lapply() to do this.

To jump ahead, the lambda will make a single recursive call to permsum() and then cbind() the return value with the current selection. This will produce a matrix, always of the same width for this level of recursion. Hence, the lapply() call will return a list of matrices, all of the same width. We must then row-bind them together, which is why we must use the do.call(rbind,...) trick here.


unname(cbind(x,permsum(s-x,m-1L)))

The body of the lambda is fairly simple; we cbind() the current selection x with the return value of the recursive call, completing the summation for this submatrix. Unfortunately we must call unname(), otherwise each column that ends up being set from the x argument will have column name x.

The most important detail here is the choice of arguments to the recursive call. First, because the lambda argument x has just been selected out during the current recursive evaluation, we must subtract it from s to get a new summation target, which the impending recursive call will be responsible for attaining. Hence the first argument becomes s-x. Second, because the selection of x takes up one column, we must subtract 1 from m, so that the recursive call will have one fewer column to produce in its output matrix.


if (m==1L) matrix(s) else ...

Lastly, let's examine the base case. In every evaluation of the recursive function we must check if m has reached 1, in which case we can simply return the required sum s itself.


Floating-point discrepancy

I looked into the discrepancy between my results and psidom's results. For example:

library(data.table);

bgoldst <- function(s,m) permsum(s,m)/s;
psidom <- function(ss,m) { raw <- do.call(data.table::CJ,rep(list(ss),m)); raw[rowSums(raw)==1,]; };

## helper function to sort a matrix by columns
smp <- function(m) m[do.call(order,as.data.frame(m)),];

s <- 100L; m <- 3L; ss <- seq_len(s-1L)/s;
x <- smp(bgoldst(s,m));
y <- smp(unname(as.matrix(psidom(ss,m))));
nrow(x);
## [1] 4851
nrow(y);
## [1] 4809

So there's a 42 row discrepancy between our two results. I decided to try to find exactly which permutations were omitted with the following line of code. Basically, it compares each element of the two matrices and prints the comparison result as a logical matrix. We can scan down the scrollback to find the first differing row. Below is the excerpted output:

x==do.call(rbind,c(list(y),rep(list(NA),nrow(x)-nrow(y))));
##          [,1]  [,2]  [,3]
##    [1,]  TRUE  TRUE  TRUE
##    [2,]  TRUE  TRUE  TRUE
##    [3,]  TRUE  TRUE  TRUE
##    [4,]  TRUE  TRUE  TRUE
##    [5,]  TRUE  TRUE  TRUE
##
## ... snip ...
##
##   [24,]  TRUE  TRUE  TRUE
##   [25,]  TRUE  TRUE  TRUE
##   [26,]  TRUE  TRUE  TRUE
##   [27,]  TRUE  TRUE  TRUE
##   [28,]  TRUE  TRUE  TRUE
##   [29,]  TRUE FALSE FALSE
##   [30,]  TRUE FALSE FALSE
##   [31,]  TRUE FALSE FALSE
##   [32,]  TRUE FALSE FALSE
##   [33,]  TRUE FALSE FALSE
##
## ... snip ...

So it's at row 29 where we have the first discrepancy. Here's a window around that row in each permutation matrix:

win <- 27:31;
x[win,]; y[win,];
##      [,1] [,2] [,3]
## [1,] 0.01 0.27 0.72
## [2,] 0.01 0.28 0.71
## [3,] 0.01 0.29 0.70 (missing from y)
## [4,] 0.01 0.30 0.69 (missing from y)
## [5,] 0.01 0.31 0.68
##      [,1] [,2] [,3]
## [1,] 0.01 0.27 0.72
## [2,] 0.01 0.28 0.71
## [3,] 0.01 0.31 0.68
## [4,] 0.01 0.32 0.67
## [5,] 0.01 0.33 0.66

Interestingly, the missing permutations normally do sum to exactly 1 when you compute the sum manually. At first I thought it was data.table's CJ() function that was doing something strange with floats, but further testing seems to indicate it's something rowSums() is doing:

0.01+0.29+0.70==1;
## [1] TRUE
ss[1L]+ss[29L]+ss[70L]==1;
## [1] TRUE
rowSums(CJ(0.01,0.29,0.70))==1; ## looks like CJ()'s fault, but wait...
## [1] FALSE
cj <- CJ(0.01,0.29,0.70);
cj$V1+cj$V2+cj$V3==1; ## not CJ()'s fault
## [1] TRUE
rowSums(matrix(c(0.01,0.29,0.70),1L,byrow=T))==1; ## rowSums()'s fault
## [1] FALSE

We can work around this rowSums() quirk by applying a manual (and somewhat arbitrary) tolerance in the floating-point comparison. To do this we need to take the absolute difference and then perform a less-than comparison against the tolerance:

abs(rowSums(CJ(0.01,0.29,0.70))-1)<1e-10;
## [1] TRUE

Hence:

psidom2 <- function(ss,m) { raw <- do.call(data.table::CJ,rep(list(ss),m)); raw[abs(rowSums(raw)-1)<1e-10,]; };
y <- smp(unname(as.matrix(psidom2(ss,m))));
nrow(y);
## [1] 4851
identical(x,y);
## [1] TRUE

Combinations

Thanks to Joseph Wood for pointing out that this is really permutations. I originally named my function combsum(), but I renamed it to permsum() to reflect this revelation. And, as Joseph suggested, it is possible to modify the algorithm to produce combinations, which can be done as follows, now living up to the name combsum():

combsum <- function(s,m,l=s) if (m==1L) matrix(s) else do.call(rbind,lapply(seq((s+m-1L)%/%m,min(l,s-m+1L)),function(x) unname(cbind(x,combsum(s-x,m-1L,x)))));
res <- combsum(100L,4L);
head(res);
##      [,1] [,2] [,3] [,4]
## [1,]   25   25   25   25
## [2,]   26   25   25   24
## [3,]   26   26   24   24
## [4,]   26   26   25   23
## [5,]   26   26   26   22
## [6,]   27   25   24   24
tail(res);
##         [,1] [,2] [,3] [,4]
## [7148,]   94    3    2    1
## [7149,]   94    4    1    1
## [7150,]   95    2    2    1
## [7151,]   95    3    1    1
## [7152,]   96    2    1    1
## [7153,]   97    1    1    1

This required 3 changes.

First, I added a new parameter l, which stands for "limit". Basically, in order to guarantee that each recursion generates a unique combination, I enforce that each selection must be less than or equal to any previous selection in the current combination. This required taking the current upper limit as a parameter l. On the top-level call l can just be defaulted to s, which is actually too high anyway for cases where m>1, but that's not a problem, since it's just one of two upper limits that will be applied during sequence generation.

The second change was of course to pass the latest selection x as the argument to l when making the recursive call in the lapply() lambda.

The final change is the trickiest. The selection sequence must now be computed as follows:

seq((s+m-1L)%/%m,min(l,s-m+1L))

The lower limit had to be raised from the 1 used in permsum() to the lowest possible selection that would still allow a descending-magnitude combination. The lowest possible selection of course depends on how many columns have yet to be produced; the more columns, the more "room" we have to leave for future selections. The formula is to take an integer division of s on m, but we also must effectively "round up", which is why I add m-1L prior to taking the division. I also considered doing a floating-point division and then calling as.integer(ceiling(...)), but I think the all-integer approach is much better.

For example, consider the case of s=10,m=3. To produce a sum of 10 with 3 columns remaining, we cannot make a selection less than 4, because then we would not have enough quantity to produce 10 without ascending along the combination. In this case, the formula divides 12 by 3 to give 4.

The upper limit can be computed from the same formula used in permsum(), except that we must also apply the current limit l using a call to min().


I've verified that my new combsum() behaves identically to Joseph's IntegerPartitionsOfLength() function for many random test cases with the following code:

## helper function to sort a matrix within each row and then by columns
smc <- function(m) smp(t(apply(m,1L,sort)));

## test loop
for (i in seq_len(1000L)) {
    repeat {
        s <- sample(1:100,1L);
        m <- sample(2:5,1L);
        if (s>=m) break;
    };
    x <- combsum(s,m);
    y <- IntegerPartitionsOfLength(s,m);
    cat(paste0(s,',',m,'\n'));
    if (!identical(smc(x),smc(y))) stop('bad.');
};

Benchmarking

Common self-contained test code:

library(microbenchmark);
library(data.table);
library(partitions);
library(gtools);

permsum <- function(s,m) if (m==1L) matrix(s) else do.call(rbind,lapply(seq_len(s-m+1L),function(x) unname(cbind(x,permsum(s-x,m-1L)))));
combsum <- function(s,m,l=s) if (m==1L) matrix(s) else do.call(rbind,lapply(seq((s+m-1L)%/%m,min(l,s-m+1L)),function(x) unname(cbind(x,combsum(s-x,m-1L,x)))));
IntegerPartitionsOfLength <- function(n, Lim, combsOnly = TRUE) { a <- 0L:n; k <- 2L; a[2L] <- n; MyParts <- vector("list", length=P(n)); count <- 0L; while (!(k==1L) && k <= Lim + 1L) { x <- a[k-1L]+1L; y <- a[k]-1L; k <- k-1L; while (x<=y && k <= Lim) {a[k] <- x; y <- y-x; k <- k+1L}; a[k] <- x+y; if (k==Lim) { count <- count+1L; MyParts[[count]] <- a[1L:k]; }; }; MyParts <- MyParts[1:count]; if (combsOnly) {do.call(rbind, MyParts)} else {MyParts}; };
GetDecimalReps <- function(s,m) { myPerms <- permutations(m,m); lim <- nrow(myPerms); intParts <- IntegerPartitionsOfLength(s,m,FALSE); do.call(rbind, lapply(intParts, function(x) { unique(t(sapply(1L:lim, function(y) x[myPerms[y, ]]))); })); };

smp <- function(m) m[do.call(order,as.data.frame(m)),];
smc <- function(m) smp(t(apply(m,1L,sort)));

bgoldst.perm <- function(s,m) permsum(s,m)/s;
psidom2 <- function(ss,m) { raw <- do.call(data.table::CJ,rep(list(ss),m)); raw[abs(rowSums(raw)-1)<1e-10,]; };
joseph.perm <- function(s,m) GetDecimalReps(s,m)/s;

bgoldst.comb <- function(s,m) combsum(s,m)/s;
joseph.comb <- function(s,m) IntegerPartitionsOfLength(s,m)/s;

Permutations

## small scale
s <- 10L; m <- 3L; ss <- seq_len(s-1L)/s;
ex <- smp(bgoldst.perm(s,m));
identical(ex,smp(unname(as.matrix(psidom2(ss,m)))));
## [1] TRUE
identical(ex,smp(joseph.perm(s,m)));
## [1] TRUE

microbenchmark(bgoldst.perm(s,m),psidom2(ss,m),joseph.perm(s,m));
## Unit: microseconds
##                expr      min        lq      mean   median        uq      max neval
##  bgoldst.perm(s, m)  347.254  389.5920  469.1011  420.383  478.7575 1869.697   100
##      psidom2(ss, m)  702.206  830.5015 1007.5111  907.265 1038.3405 2618.089   100
##   joseph.perm(s, m) 1225.225 1392.8640 1722.0070 1506.833 1860.0745 4411.234   100

## large scale
s <- 100L; m <- 4L; ss <- seq_len(s-1L)/s;
ex <- smp(bgoldst.perm(s,m));
identical(ex,smp(unname(as.matrix(psidom2(ss,m)))));
## [1] TRUE
identical(ex,smp(joseph.perm(s,m)));
## [1] TRUE

microbenchmark(bgoldst.perm(s,m),psidom2(ss,m),joseph.perm(s,m),times=5L);
## Unit: seconds
##                expr      min        lq      mean    median        uq       max neval
##  bgoldst.perm(s, m) 1.286856  1.304177  1.426376  1.374411  1.399850  1.766585     5
##      psidom2(ss, m) 6.673545  7.046951  7.416161  7.115375  7.629177  8.615757     5
##   joseph.perm(s, m) 5.299452 10.499891 13.769363 12.680607 15.107748 25.259117     5

## very large scale
s <- 100L; m <- 5L; ss <- seq_len(s-1L)/s;
ex <- smp(bgoldst.perm(s,m));
identical(ex,smp(unname(as.matrix(psidom2(ss,m)))));
## Error: cannot allocate vector of size 70.9 Gb
identical(ex,smp(joseph.perm(s,m)));
## [1] TRUE

microbenchmark(bgoldst.perm(s,m),joseph.perm(s,m),times=1L);
## Unit: seconds
##                expr      min       lq     mean   median       uq      max neval
##  bgoldst.perm(s, m) 28.58359 28.58359 28.58359 28.58359 28.58359 28.58359     1
##   joseph.perm(s, m) 50.51965 50.51965 50.51965 50.51965 50.51965 50.51965     1

Combinations

## small-scale
s <- 10L; m <- 3L;
ex <- smc(bgoldst.comb(s,m));
identical(ex,smc(joseph.comb(s,m)));
## [1] TRUE

microbenchmark(bgoldst.comb(s,m),joseph.comb(s,m));
## Unit: microseconds
##                expr     min       lq     mean   median       uq      max neval
##  bgoldst.comb(s, m) 161.225 179.6145 205.0898 187.3120 199.5005 1310.328   100
##   joseph.comb(s, m) 172.344 191.8025 204.5681 197.7895 205.2735  437.489   100

## large-scale
s <- 100L; m <- 4L;
ex <- smc(bgoldst.comb(s,m));
identical(ex,smc(joseph.comb(s,m)));
## [1] TRUE

microbenchmark(bgoldst.comb(s,m),joseph.comb(s,m),times=5L);
## Unit: milliseconds
##                expr       min        lq      mean    median       uq       max neval
##  bgoldst.comb(s, m)  409.0708  485.9739  556.4792  591.4774  627.419  668.4548     5
##   joseph.comb(s, m) 2164.2134 3315.0138 3317.9725 3540.6240 3713.732 3856.2793     5

## very large scale
s <- 100L; m <- 6L;
ex <- smc(bgoldst.comb(s,m));
identical(ex,smc(joseph.comb(s,m)));
## [1] TRUE

microbenchmark(bgoldst.comb(s,m),joseph.comb(s,m),times=1L);
## Unit: seconds
##                expr       min        lq      mean    median        uq       max neval
##  bgoldst.comb(s, m)  2.498588  2.498588  2.498588  2.498588  2.498588  2.498588     1
##   joseph.comb(s, m) 12.344261 12.344261 12.344261 12.344261 12.344261 12.344261     1
like image 142
bgoldst Avatar answered Oct 16 '22 20:10

bgoldst


Take m=4 for example, a memory intensive approach would be:

raw <- data.table::CJ(s,s,s,s)
result <- raw[rowSums(raw) == 1, ]    
head(result)

     V1   V2   V3   V4
1: 0.01 0.01 0.01 0.97
2: 0.01 0.01 0.02 0.96
3: 0.01 0.01 0.03 0.95
4: 0.01 0.01 0.04 0.94
5: 0.01 0.01 0.05 0.93
6: 0.01 0.01 0.06 0.92
like image 29
Psidom Avatar answered Oct 16 '22 21:10

Psidom


Here is an algorithm that will return pure combinations (order doesn't matter). It is based off of an integer partition algorithm built by Jerome Kelleher (link).

library(partitions)
IntegerPartitionsOfLength <- function(n, Lim, combsOnly = TRUE) {
    a <- 0L:n
    k <- 2L
    a[2L] <- n
    MyParts <- vector("list", length=P(n))
    count <- 0L
    while (!(k==1L) && k <= Lim + 1L) {
        x <- a[k-1L]+1L
        y <- a[k]-1L
        k <- k-1L
        while (x<=y && k <= Lim) {a[k] <- x; y <- y-x; k <- k+1L}
        a[k] <- x+y
        if (k==Lim) {
            count <- count+1L
            MyParts[[count]] <- a[1L:k]
        }
    }
    MyParts <- MyParts[1:count]
    if (combsOnly) {do.call(rbind, MyParts)} else {MyParts}
}

system.time(res <- combsum(100L,5L))
 user  system elapsed 
 0.75    0.00    0.77

system.time(a <- IntegerPartitionsOfLength(100, 5))
 user  system elapsed 
 1.36    0.37    1.76

identical(smc(a),smc(res))
[1] TRUE

head(a)
[,1] [,2] [,3] [,4] [,5]
[1,]    1    1    1    1   96
[2,]    1    1    1    2   95
[3,]    1    1    1    3   94
[4,]    1    1    1    4   93
[5,]    1    1    1    5   92
[6,]    1    1    1    6   91

A really large example (N.B. using the smc function created by @bgoldst):

system.time(a <- IntegerPartitionsOfLength(100L,6L))
 user  system elapsed 
 4.57    0.36    4.93

system.time(res <- combsum(100L,6L))
 user  system elapsed 
 3.69    0.00    3.71

identical(smc(a),smc(res))
[1] TRUE

## this would take a very long time with GetDecimalReps below

Note: IntegerPartitionsOfLength only returns the combinations of a particular set of numbers and not the permutations of a set of numbers (order matters). E.g. for the set s = (1, 1, 3), the combinations of s is precisely s, while the permutations of s are: (1, 1, 3), (1, 3, 1), (3, 1, 1).

If you want an answer like the OP is asking for, you will have to do something like this (this is by no means the best way and isn't nearly as efficient as @bgoldst's permsum above):

library(gtools)
GetDecimalReps <- function(n) {
    myPerms <- permutations(n,n); lim <- nrow(myPerms)
    intParts <- IntegerPartitionsOfLength(100,n,FALSE)
    do.call(rbind, lapply(intParts, function(x) {
        unique(t(sapply(1L:lim, function(y) x[myPerms[y, ]])))
    }))
}

system.time(a <- GetDecimalReps(4L))
 user  system elapsed 
 2.85    0.42    3.28

system.time(res <- combsum(100L,4L))
 user  system elapsed 
 1.35    0.00    1.34

head(a/100)
[,1] [,2] [,3] [,4]
[1,] 0.01 0.01 0.01 0.97
[2,] 0.01 0.01 0.97 0.01
[3,] 0.01 0.97 0.01 0.01
[4,] 0.97 0.01 0.01 0.01
[5,] 0.01 0.01 0.02 0.96
[6,] 0.01 0.01 0.96 0.02

tail(a/100)
[,1] [,2] [,3] [,4]
[156844,] 0.25 0.26 0.24 0.25
[156845,] 0.25 0.26 0.25 0.24
[156846,] 0.26 0.24 0.25 0.25
[156847,] 0.26 0.25 0.24 0.25
[156848,] 0.26 0.25 0.25 0.24
[156849,] 0.25 0.25 0.25 0.25

identical(smp(a),smp(res))  ## using the smp function created by @bgoldst
[1] TRUE

@bgoldst algorithms above are superior for both return types (i.e. combinations/permutations). Also see @bgoldst's excellent benchmarks above. As a closing remark, you can easily modify IntegerPartionsOfLength to obtain all combinations of 1:100 that sum to 100 for k <= m by simply changing the k==Lim to k <= Lim and also setting combsOnly = FALSE in order to return a list. Cheers!

like image 3
Joseph Wood Avatar answered Oct 16 '22 20:10

Joseph Wood