Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert sequence of integers 1, 2, 3, ... to corresponding sequence of strings A, B, C,

Tags:

r

What's a quick, scalable way to convert the integers 1 through N to a corresponding sequence of strings "A", "B", ... "Z", "AA", "AB", ... of the same length?

Alternatively, I'd be happy with something maps the integer vector onto a character vector such that each element of the character vector has the same number of characters. E.g. 1, 2, ... 27 => "AA", "AB", ..., "AZ", "BA"

Example input:

num_vec <- seq(1, 1000)
char_vec <- ???

UPDATE

My hackish, but best working attempt:

library(data.table)
myfunc <- function(n){
  if(n <= 26){
    dt <- CJ(LETTERS)[, Result := paste0(V1)]
  } else if(n <= 26^2){
    dt <- CJ(LETTERS, LETTERS)[, Result := paste0(V1, V2)]
  } else if(n <= 26^3){
    dt <- CJ(LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3)]
  } else if(n <= 26^4){
    dt <- CJ(LETTERS, LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3, V4)]
  } else if(n <= 26^5){
    dt <- CJ(LETTERS, LETTERS, LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3, V4, V5)]
  } else if(n <= 26^6){
    dt <- CJ(LETTERS, LETTERS, LETTERS, LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3, V4, V5, V6)]
  } else{
    stop("n too large")
  }

  return(dt$Result[1:n])
}

myfunc(10)
like image 981
Ben Avatar asked Sep 27 '16 18:09

Ben


3 Answers

Several nice solutions were posted in the comments already. Only the solution posted by @Gregor here is currently giving the preferred solution by Ben.

However, the methods posted by @eddi, @DavidArenburg and @G.Grothendieck can be adapted to get the prefered outcome as well:

# adaptation of @eddi's method:
library(data.table)
n  <- 29
sz  <- ceiling(log(n)/log(26))
do.call(CJ, replicate(sz, c("", LETTERS), simplify = F))[-1, unique(Reduce(paste0, .SD))][1:n]

# adaptation of @DavidArenburg's method:
n <- 29
list(LETTERS, c(LETTERS, do.call(CJ, replicate((n - 1) %/% 26 + 1, LETTERS, simplify = FALSE))[, do.call(paste0, .SD)][1:(n-26)])[[(n>26)+1]]

# adaptation of @G.Grothendieck's method:
n  <- 29
sz  <- ceiling(log(n)/log(26))
g <- expand.grid(c('',LETTERS), rep(LETTERS, (sz-1)))
g <- g[order(g$Var1),]
do.call(paste0, g)[1:n]

All three result in:

 [1] "A"  "B"  "C"  "D"  "E"  "F"  "G"  "H"  "I"  "J"  "K"  "L"  "M"  "N"  "O" 
[16] "P"  "Q"  "R"  "S"  "T"  "U"  "V"  "W"  "X"  "Y"  "Z"  "AA" "AB" "AC"
like image 54
4 revs, 2 users 96% Avatar answered Oct 18 '22 05:10

4 revs, 2 users 96%


This seems like an awesome candidate for Rcpp. Below is the very simple approach:

// [[Rcpp::export]]
StringVector combVec(CharacterVector x, CharacterVector y) {
    int nx = x.size();
    int ny = y.size();
    CharacterVector z(nx*ny);
    int k = 0;
    for (int i = 0; i < nx; i++) {
        for (int j = 0; j < ny; j++) {
            z[k] = x[i];
            z[k] += y[j];
            k++;
        }
    }
    return z;  
}

NumChar <- function(n) {
    t <- trunc(log(n, 26))
    ch <- LETTERS
    for (i in t:1L) {ch <- combVec(ch, LETTERS)}
    ch[1:n]
}

The result is exactly what the OP's answer returns.

library(data.table)
Rcpp::sourceCpp('combVec.cpp')

identical(myfunc(100000), NumChar(100000))
[1] TRUE 

head(NumChar(100000))
[1] "AAAA" "AAAB" "AAAC" "AAAD" "AAAE" "AAAF"
tail(NumChar(100000))
[1] "FRXY" "FRXZ" "FRYA" "FRYB" "FRYC" "FRYD"

Updated benchmarks including @eddi's excellent Rcpp implementation:

library(microbenchmark)

microbenchmark(myfunc(10000), funEddi(10000), NumChar(10000), excelCols(10000, LETTERS))
Unit: microseconds
                     expr       min        lq       mean     median        uq       max neval  cld
            myfunc(10000)  6632.125  7255.454  8441.7770  7912.4780  9283.660 14184.971   100   c 
           funEddi(10000) 12012.673 12869.928 15296.3838 13870.7050 16425.907 80443.142   100    d
           NumChar(10000)  2592.555  2883.394  3326.9292  3167.4995  3574.300  6051.273   100  b  
excelCols(10000, LETTERS)   636.165   656.820   782.7679   716.9225   811.148  1386.673   100 a 

microbenchmark(myfunc(100000), funEddi(100000), NumChar(100000), excelCols(100000, LETTERS), times = 10)
Unit: milliseconds
                     expr        min         lq       mean    median        uq       max neval  cld
            myfunc(1e+05) 203.992591 210.049303 255.049395 220.74955 262.52141 397.03521    10   c 
           funEddi(1e+05) 523.934475 530.646483 563.853995 552.83903 577.88915 688.84714    10    d
           NumChar(1e+05)  82.216802  83.546577  97.615537  93.63809 112.14316 115.84911    10  b  
excelCols(1e+05, LETTERS)   7.480882   8.377266   9.562554   8.93254  11.10519  14.11631    10 a   

As @DirkEddelbuettel says "Rcpp is not some magic pony...". These discrepancies in efficiency just show that although Rcpp, or any package for that matter, is super awesome, they won't fix crappy code. Thanks @eddi for posting a proper Rcpp implementation.

like image 20
Joseph Wood Avatar answered Oct 18 '22 05:10

Joseph Wood


Here's a fast Rcpp solution which will be orders of magnitude faster than native R solutions:

cppFunction('CharacterVector excelCols(int n, CharacterVector x) {
  CharacterVector res(n);
  int sz = x.size();
  std::string base;
  int baseN[100] = {0}; // being lazy about size here - you will never grow larger than this
  for (int i = 0; i < n; ++i) {
    bool incr = false;
    for (int j = base.size() - 1; j >= 0 && !incr; --j) {
      if (baseN[j] == sz) {
        baseN[j] = 1;
        base[j] = as<std::string>(x[0])[0];
      } else {
        baseN[j] += 1;
        base[j] = as<std::string>(x[baseN[j] - 1])[0];
        incr = true;
      }
    }
    if (!incr) {
      baseN[base.size()] = 1;
      base += x[0];
    }
    res[i] = base;
  }
  return res;
}')

excelCols(100, LETTERS)
like image 27
eddi Avatar answered Oct 18 '22 05:10

eddi