I have a simple vector of integers in R. I would like to randomly select n positions in the vector and "merge" them (i.e. sum) in the vector. This process could happen multiple times, i.e. in a vector of 100, 5 merging/summing events could occur, with 2, 3, 2, 4, and 2 vector positions being merged in each event, respectively. For instance:
#An example original vector of length 10:
ex.have<-c(1,1,30,16,2,2,2,1,1,9)
#For simplicity assume some process randomly combines the
#first two [1,1] and last three [1,1,9] positions in the vector.
ex.want<-c(2,30,16,2,2,2,11)
#Here, there were two merging events of 2 and 3 vector positions, respectively
#EDIT: the merged positions do not need to be consecutive.
#They could be randomly selected from any position.
But in addition I also need to record how many vector positions were "merged," (including the value 1 if the position in the vector was not merged) - terming them indices. Since the first two were merged and the last three were merged in the example above, the indices data would look like:
ex.indices<-c(2,1,1,1,1,1,3)
Finally, I need to put it all in a matrix, so the final data in the example above would be a 2-column matrix with the integers in one column and the indices in another:
ex.final<-matrix(c(2,30,16,2,2,2,11,2,1,1,1,1,1,3),ncol=2,nrow=7)
At the moment I am seeking assistance even on the simplest step: combining positions in the vector. I have tried multiple variations on the sample
and split
functions, but am hitting a dead end. For instance, sum(sample(ex.have,2))
will sum two randomly selected positions (or sum(sample(ex.have,rpois(1,2))
will add some randomness in the n values), but I am unsure how to leverage this to achieve the desired dataset. An exhaustive search has led to multiple articles on combining vectors, but not positions in vectors, so I apologize if this is a duplicate. Any advice on how to approach any of this would be much appreciated.
Here is a function I designed to perform the task you described.
The vec_merge
function takes the following arguments:
x
: an integer vector.
event_perc
: The percentage of an event. This is a number of between 0 to 1 (although 1 is probably too large). The number of events is calculated as the length of x
multiplied by event_perc
.
sample_n
: The merge sample numbers. This is an integer vector with all numbers larger or at least equal to 2
.
vec_merge <- function(x, event_perc = 0.2, sample_n = c(2, 3)){
# Check if event_perc makes sense
if (event_perc > 1 | event_perc <= 0){
stop("event_perc should be between 0 to 1.")
}
# Check if sample_n makes sense
if (any(sample_n < 2)){
stop("sample_n should be at least larger than 2")
}
# Determine the event numbers
n <- round(length(x) * event_perc)
# Determine the sample number of each event
sample_vec <- sample(sample_n, size = n, replace = TRUE)
names(sample_vec) <- paste0("S", 1:n)
# Check if the sum of sample_vec is larger than the length of x
# If yes, stop the function and print a message
if (length(x) < sum(sample_vec)){
stop("Too many samples. Decrease event_perc or sampel_n")
}
# Determine the number that will not be merged
n2 <- length(x) - sum(sample_vec)
# Create a vector with replicated 1 based on m
non_merge_vec <- rep(1, n2)
names(non_merge_vec) <- paste0("N", 1:n2)
# Combine sample_vec and non_merge_vec, and then randomly sorted the vector
combine_vec <- c(sample_vec, non_merge_vec)
combine_vec2 <- sample(combine_vec, size = length(combine_vec))
# Expand the vector
expand_list <- list(lengths = combine_vec2, values = names(combine_vec2))
expand_vec <- inverse.rle(expand_list)
# Create a data frame with x and expand_vec
dat <- data.frame(number = x,
group = factor(expand_vec, levels = unique(expand_vec)))
dat$index <- 1
dat2 <- aggregate(cbind(dat$number, dat$index),
by = list(group = dat$group),
FUN = sum)
# # Convert dat2 to a matrix, remove the group column
dat2$group <- NULL
mat <- as.matrix(dat2)
return(mat)
}
Here is a test for the function. I applied the function to the sequence from 1 to 10. As you can see, in this example, 4
and 5
is merged, and 8
and 9
is also merged.
set.seed(123)
vec_merge(1:10)
# number index
# [1,] 1 1
# [2,] 2 1
# [3,] 3 1
# [4,] 9 2
# [5,] 6 1
# [6,] 7 1
# [7,] 17 2
# [8,] 10 1
I suppose you could write a function like the following:
fun <- function(vec = have, events = merge_events, include_orig = TRUE) {
if (sum(events) > length(vec)) stop("Too many events to merge")
# Create "groups" for the events
merge_events_seq <- rep(seq_along(events), events)
# Create "groups" for the rest of the data
remainder <- sequence((length(vec) - sum(events))) + length(events)
# Combine both groups and shuffle them so that the
# positions being combined are not necessarily consecutive
inds <- sample(c(merge_events_seq, remainder))
# Aggregate using `data.table`
temp <- data.table(values = vec, groups = inds)[
, list(count = length(values),
total = sum(values),
pos = toString(.I),
original = toString(values)), groups][, groups := NULL]
# Drop the other columns if required. Return the output.
if (isTRUE(include_orig)) temp[] else temp[, c("original", "pos") := NULL][]
}
The function returns four columns:
ex.indices
).ex.want
).pos
itions of the original values from the input vector.The last two columns can be dropped from the result by setting include_orig = FALSE
. The function will also produce an error if the number of elements you're trying to merge exceeds the length of the input (ex.have
) vector.
Here's some sample data:
library(data.table)
set.seed(1) ## So you can recreate these examples with the same results
have <- sample(20, 10, TRUE)
have
## [1] 4 7 1 2 11 14 18 19 1 10
merge_events <- c(2, 3)
fun(have, merge_events)
## count total pos original
## 1: 1 4 1 4
## 2: 1 7 2 7
## 3: 2 2 3, 9 1, 1
## 4: 1 2 4 2
## 5: 3 40 5, 8, 10 11, 19, 10
## 6: 1 14 6 14
## 7: 1 18 7 18
fun(events = c(3, 4))
## count total pos original
## 1: 4 39 1, 4, 6, 8 4, 2, 14, 19
## 2: 3 36 2, 5, 7 7, 11, 18
## 3: 1 1 3 1
## 4: 1 1 9 1
## 5: 1 10 10 10
fun(events = c(6, 4, 3))
## Error: Too many events to merge
input <- sample(30, 20, TRUE)
input
## [1] 6 10 10 6 15 20 28 20 26 12 25 23 6 25 8 12 25 23 24 6
fun(input, events = c(4, 7, 2, 3))
## count total pos original
## 1: 7 92 1, 3, 4, 5, 11, 19, 20 6, 10, 6, 15, 25, 24, 6
## 2: 1 10 2 10
## 3: 3 71 6, 9, 14 20, 26, 25
## 4: 4 69 7, 12, 13, 16 28, 23, 6, 12
## 5: 2 45 8, 17 20, 25
## 6: 1 12 10 12
## 7: 1 8 15 8
## 8: 1 23 18 23
# Verification
input[c(1, 3, 4, 5, 11, 19, 20)]
## [1] 6 10 6 15 25 24 6
sum(.Last.value)
## [1] 92
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With