What's a quick, scalable way to convert the integers 1 through N to a corresponding sequence of strings "A", "B", ... "Z", "AA", "AB", ... of the same length?
Alternatively, I'd be happy with something maps the integer vector onto a character vector such that each element of the character vector has the same number of characters. E.g. 1, 2, ... 27 => "AA", "AB", ..., "AZ", "BA"
Example input:
num_vec <- seq(1, 1000)
char_vec <- ???
UPDATE
My hackish, but best working attempt:
library(data.table)
myfunc <- function(n){
if(n <= 26){
dt <- CJ(LETTERS)[, Result := paste0(V1)]
} else if(n <= 26^2){
dt <- CJ(LETTERS, LETTERS)[, Result := paste0(V1, V2)]
} else if(n <= 26^3){
dt <- CJ(LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3)]
} else if(n <= 26^4){
dt <- CJ(LETTERS, LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3, V4)]
} else if(n <= 26^5){
dt <- CJ(LETTERS, LETTERS, LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3, V4, V5)]
} else if(n <= 26^6){
dt <- CJ(LETTERS, LETTERS, LETTERS, LETTERS, LETTERS, LETTERS)[, Result := paste0(V1, V2, V3, V4, V5, V6)]
} else{
stop("n too large")
}
return(dt$Result[1:n])
}
myfunc(10)
Several nice solutions were posted in the comments already. Only the solution posted by @Gregor here is currently giving the preferred solution by Ben.
However, the methods posted by @eddi, @DavidArenburg and @G.Grothendieck can be adapted to get the prefered outcome as well:
# adaptation of @eddi's method:
library(data.table)
n <- 29
sz <- ceiling(log(n)/log(26))
do.call(CJ, replicate(sz, c("", LETTERS), simplify = F))[-1, unique(Reduce(paste0, .SD))][1:n]
# adaptation of @DavidArenburg's method:
n <- 29
list(LETTERS, c(LETTERS, do.call(CJ, replicate((n - 1) %/% 26 + 1, LETTERS, simplify = FALSE))[, do.call(paste0, .SD)][1:(n-26)])[[(n>26)+1]]
# adaptation of @G.Grothendieck's method:
n <- 29
sz <- ceiling(log(n)/log(26))
g <- expand.grid(c('',LETTERS), rep(LETTERS, (sz-1)))
g <- g[order(g$Var1),]
do.call(paste0, g)[1:n]
All three result in:
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
[16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z" "AA" "AB" "AC"
This seems like an awesome candidate for Rcpp
. Below is the very simple approach:
// [[Rcpp::export]]
StringVector combVec(CharacterVector x, CharacterVector y) {
int nx = x.size();
int ny = y.size();
CharacterVector z(nx*ny);
int k = 0;
for (int i = 0; i < nx; i++) {
for (int j = 0; j < ny; j++) {
z[k] = x[i];
z[k] += y[j];
k++;
}
}
return z;
}
NumChar <- function(n) {
t <- trunc(log(n, 26))
ch <- LETTERS
for (i in t:1L) {ch <- combVec(ch, LETTERS)}
ch[1:n]
}
The result is exactly what the OP's answer returns.
library(data.table)
Rcpp::sourceCpp('combVec.cpp')
identical(myfunc(100000), NumChar(100000))
[1] TRUE
head(NumChar(100000))
[1] "AAAA" "AAAB" "AAAC" "AAAD" "AAAE" "AAAF"
tail(NumChar(100000))
[1] "FRXY" "FRXZ" "FRYA" "FRYB" "FRYC" "FRYD"
Updated benchmarks including @eddi's excellent Rcpp
implementation:
library(microbenchmark)
microbenchmark(myfunc(10000), funEddi(10000), NumChar(10000), excelCols(10000, LETTERS))
Unit: microseconds
expr min lq mean median uq max neval cld
myfunc(10000) 6632.125 7255.454 8441.7770 7912.4780 9283.660 14184.971 100 c
funEddi(10000) 12012.673 12869.928 15296.3838 13870.7050 16425.907 80443.142 100 d
NumChar(10000) 2592.555 2883.394 3326.9292 3167.4995 3574.300 6051.273 100 b
excelCols(10000, LETTERS) 636.165 656.820 782.7679 716.9225 811.148 1386.673 100 a
microbenchmark(myfunc(100000), funEddi(100000), NumChar(100000), excelCols(100000, LETTERS), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
myfunc(1e+05) 203.992591 210.049303 255.049395 220.74955 262.52141 397.03521 10 c
funEddi(1e+05) 523.934475 530.646483 563.853995 552.83903 577.88915 688.84714 10 d
NumChar(1e+05) 82.216802 83.546577 97.615537 93.63809 112.14316 115.84911 10 b
excelCols(1e+05, LETTERS) 7.480882 8.377266 9.562554 8.93254 11.10519 14.11631 10 a
As @DirkEddelbuettel says "Rcpp is not some magic pony...". These discrepancies in efficiency just show that although Rcpp
, or any package for that matter, is super awesome, they won't fix crappy code. Thanks @eddi for posting a proper Rcpp
implementation.
Here's a fast Rcpp
solution which will be orders of magnitude faster than native R solutions:
cppFunction('CharacterVector excelCols(int n, CharacterVector x) {
CharacterVector res(n);
int sz = x.size();
std::string base;
int baseN[100] = {0}; // being lazy about size here - you will never grow larger than this
for (int i = 0; i < n; ++i) {
bool incr = false;
for (int j = base.size() - 1; j >= 0 && !incr; --j) {
if (baseN[j] == sz) {
baseN[j] = 1;
base[j] = as<std::string>(x[0])[0];
} else {
baseN[j] += 1;
base[j] = as<std::string>(x[baseN[j] - 1])[0];
incr = true;
}
}
if (!incr) {
baseN[base.size()] = 1;
base += x[0];
}
res[i] = base;
}
return res;
}')
excelCols(100, LETTERS)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With