I have a big data frame, and I want strings to be aligned in columns based on suffixes (substrings), the source dataframe looks like this:
notst stands for other variable prefix to be ignored
# col1 col2 col3
# notst-s1 notst-s2 notst-x3
# notst-s1 notst-x3 notst-a5
# notst-s2 notst-a5
# notst-x3 notst-a5
The result, should be:
# col1 col2 col3 col4
# notst-s1 notst-s2 notst-x3
# notst-s1 notst-x3 notst-a5
# notst-s2 notst-a5
# notst-x3 notst-a5
Edit:
Consider the whole suffix (after "-"). It does not have numbers. There are cases in which the whole string ("xxxx-spst") should be matched (*) because the xxxx part of the string comes in several versions.
For:
df <- read.table(text="
col1 col2 col3
st1-ab stb-spst sta-spst
stc-spst sta-spst st4-ab
stb-spst st7-ab
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
a possible result, could be: (column name and order is irrelevant)
# col1 col2 col3 col4
# st1-ab stb-spst sta-spst
# st4-ab stc-spst sta-spst
# st7-ab stb-spst
# stb-spst st9-ba
(*) Note that in row 2, col2, "stc-spst" seems misplaced, but it is not a problem because the value stb-spst does not exist in that row, so for that particular case, only the suffix ("spst") matters. In other words, when the whole string (prefix-suffix) matches others (in other rows), they should be in the same column, if not, when the suffix matches the suffix (of other rows), they should be in the same column. The resulting dataframe should have the same number of rows as the original and the lowest number of columns possible.
EDIT. answer should be universal and work for:
df2 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stc-spst sta-spst st4-ab st2-ab
stb-spst st7-ab sa-ac
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
for example, also. Possible result:
# col1 col2 col3 col4 col5 col6 col7
# st1-ab stb-spst sta-spst std-spst
# st4-ab stc-spst sta-spst st2-ab
# st7-ab stb-spst sa-ac
# stb-spst st9-ba
example 3
df3 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst sta-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
desired output
col1 col2 col3 col4 col5
1 st1-ab sta-spst stb-spst std-spst
2 sta-ab stb-spst
3 sa-ac st7-ab sta-spst
4 sta-spst stb-spst
EDIT example 4. In order to make the task easier, you can explicitly define in a function the suffixes that may have more than one possible prefix per row. In this example ("spst"). So any string with suffix different to "spst" should have only one possible prefix per row and can and must be collapsed into one column in the resulting df, as the col2 in the desired output. This is not what I wanted originally because I will get more columns than expected. Ideally strings containing spst and different prefixes should appear in the lowest numbers of columns possible. See (* above).
df4 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst st1-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst st7-ab",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
desired output
row_id col1 col2 col3 col4 col5
1 st1-ab sta-spst stb-spst std-spst
2 st1-ab stb-spst
3 sa-ac st7-ab sta-spst
4 st7-ab sta-spst stb-spst
We can do this by first melt
ing the dataset, extract the numeric index from the elements, create a row/column index based on that and assign the elements to a matrix
created based on the max value of the index.
library(reshape2)
d1 <- na.omit(transform(melt(as.matrix(df1)), v1 = as.numeric(sub("\\D+", "", value))))
m1 <- matrix("", nrow = max(d1$Var1), ncol = max(d1$v1))
m1[as.matrix(d1[c("Var1", "v1")])] <- as.character(d1$value)
d2 <- as.data.frame(m1[,!!colSums(m1!="")])
colnames(d2) <- paste0("col", seq_along(d2))
d2
# col1 col2 col3 col4
#1 notst-s1 notst-s2 notst-x3
#2 notst-s1 notst-x3 notst-a5
#3 notst-s2 notst-a5
#4 notst-x3 notst-a5
Matrix indexing might make this a possibility:
sel <- dat!=""
unq <- unique(dat[sel])
mat <- matrix(NA, nrow=nrow(dat), ncol=length(unq))
mat[cbind(row(dat)[sel], match(dat[sel], unq) )] <- dat[sel]
# [,1] [,2] [,3] [,4]
#[1,] "notst-s1" "notst-s2" "notst-x3" NA
#[2,] "notst-s1" NA "notst-x3" "notst-a5"
#[3,] NA "notst-s2" NA "notst-a5"
#[4,] NA NA "notst-x3" "notst-a5"
Where dat
was imported as:
dat <- read.table(text="
col1 col2 col3
notst-s1 notst-s2 notst-x3
notst-s1 notst-x3 notst-a5
notst-s2 notst-a5
notst-x3 notst-a5",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
Tested with four examples, but this version was done without regard for the information you added as a workaround in example 4.
The main addition is shuffle logic (which may be quite slow) to compact the resulting dataframe form right to left. It's possible that the assigned_by_suffix
and the assigned_by_single_suffix
are no longer required, but I have not verified.
Outputs are at the end of the code
# examples
df1 <- read.table(text="
col1 col2 col3
st1-ab stb-spst sta-spst
stc-spst sta-spst st4-ab
stb-spst st7-ab
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
df2 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stc-spst sta-spst st4-ab st2-ab
stb-spst st7-ab sa-ac
st9-ba stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
df3 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst sta-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
df4 <- read.table(text="
col1 col2 col3 col4
st1-ab stb-spst sta-spst std-spst
stb-spst st1-ab
sta-spst st7-ab sa-ac
sta-spst stb-spst st7-ab",header=TRUE,fill=TRUE,stringsAsFactors=FALSE)
library(reshape2)
library(tidyr)
library(dplyr)
library(stringr)
library(assertthat)
suffix <- function(s) {str_extract(s, "[^\\-]+$")}
# make a tall dataframe with melt, and get the suffix
dfm <- df4 %>%
mutate(row_id = seq_along(col1)) %>%
melt(id.vars="row_id") %>%
select(-2) %>%
filter(value != "") %>%
mutate(suffix = suffix(value)) %>%
arrange(value)
assert_that(!any(duplicated(dfm[c("row_id", "value")])))
# initialize
combined <- data.frame()
remaining <- dfm
# get the groups with more than 1 value
matched_values <- dfm %>%
group_by(value, suffix) %>%
summarize(n=n()) %>%
filter(n>1) %>%
rename(group_id = value) %>%
ungroup()
# .. and assign the group ids that match
assigned_by_value <- remaining %>%
inner_join(matched_values %>% select(group_id), by = c("value" = "group_id")) %>%
mutate(group_id = value) %>%
select(row_id, value, suffix, group_id)
combined <- combined %>% bind_rows(assigned_by_value)
remaining <- dfm %>% anti_join(combined, by=c("row_id", "value"))
# find the remaining suffixes
matched_suffixes <- remaining %>%
group_by(suffix) %>%
summarize(n=n()) %>%
filter(n>1) %>%
select(-n) %>%
ungroup()
# ... and assign those that match
assigned_by_suffix <- remaining %>%
inner_join(matched_suffixes, by="suffix") %>%
mutate(group_id = suffix)
combined <- bind_rows(combined, assigned_by_suffix)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))
# All that remain are singles assign matches by suffix, choosing the match with fewest
assigned_by_single_suffix <- remaining %>%
inner_join(matched_values, by = "suffix") %>%
top_n(1, n) %>%
head(1) %>%
select(-n)
combined <- bind_rows(combined, assigned_by_single_suffix)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))
# get the remaining unmatched
unmatched <- remaining%>%
mutate(group_id = value)
combined <- bind_rows(combined, unmatched)
remaining <- remaining %>% anti_join(combined, by=c("row_id", "value"))
assert_that(nrow(remaining) == 0)
# any overloads (duplicates) need to bump to their own column
dups <- duplicated(combined[,c("row_id", "group_id")])
combined$group_id[dups] <- combined$value[dups]
assert_that(nrow(combined) == nrow(dfm))
# spread the result
result <- spread(combined %>% select(-suffix), group_id, value, fill ="")
# Shuffle any matching suffix from right to left, so l long as there
# is corresponding space an that the whole column can move
# i is source (startign from right) - j is target (starting from right)
#
drop_cols = c()
suffixes <- suffix(names(result))
for (i in (ncol(result)):3) {
for(j in (i-1):2) {
if (suffixes[i] == suffixes[j]) {
non_empty <- which(result[,i] != "") # list of source to move
can_fill <- which(result[,j] == "") # list of targets can be filled
can_move <- all(non_empty %in% can_fill) # is to move a subset of can_fill?
# if there's space, shuffle the column down
if (can_move ) {
# shuffle down
result[,j] <- if_else(result[,j] != "", result[,j], result[,i])
drop_cols <- c(drop_cols, i)
result[,i] <- NA
break
}
}
}
}
if (!is.null(drop_cols)) {
result <- result[,-drop_cols]
}
result
# Example 1
# row_id ab st9-ba sta-spst stb-spst
# 1 1 st1-ab sta-spst stb-spst
# 2 2 st4-ab sta-spst stc-spst
# 3 3 st7-ab stb-spst
# 4 4 st9-ba stb-spst
# Example 2
# row_id ab sa-ac spst st2-ab st9-ba sta-spst stb-spst
# 1 1 st1-ab std-spst sta-spst stb-spst
# 2 2 st4-ab stc-spst st2-ab sta-spst
# 3 3 st7-ab sa-ac stb-spst
# 4 4 st9-ba stb-spst
# Example 3
# row_id ab sa-ac sta-spst stb-spst std-spst
# 1 1 st1-ab sta-spst stb-spst std-spst
# 2 2 sta-ab stb-spst
# 3 3 st7-ab sa-ac sta-spst
# 4 4 sta-spst stb-spst
# Example 4
# row_id sa-ac st1-ab sta-spst stb-spst std-spst
# 1 1 st1-ab sta-spst stb-spst std-spst
# 2 2 st1-ab stb-spst
# 3 3 sa-ac st7-ab sta-spst
# 4 4 st7-ab sta-spst stb-spst
>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With