I have this dataframe, it looks like this:
I need to take the first character from the column at, the whole value in an, then put a counter on the end that increments for repeats in column an. This counter has to be always length of three. The end result is this:
So nothing here that dramatic, I was able to do this with the following code (prepare to be impressed):
library(stringr)
tk <- ""
for (i in 1:nrow(df)){
if (tk == df$an[i]){
counter <- counter + 1
} else {
tk <- df$an[i]
counter <- 1
}
df$ap[i] <- counter
}
df$ap <- paste0(substr(df$at, 1, 1), df$an, str_pad(df$ap, 3, pad="0"))
I'm so not satisfied with this debacle. It seems not very "R" and I'd like very much never to allow this to see the light of day. How can I make this more "R"?
I appreciate the advice.
library(stringr)
library(dplyr)
df1 <- df %>%
group_by(an) %>%
mutate(ap=paste0(substr(at, 1, 1), an, str_pad(row_number(), 3, pad="0")))
at an ap
1 NDA 023356 N023356001
2 ANDA 023357 A023357001
3 ANDA 023357 A023357002
4 NDA 023357 N023357003
5 ANDA 023398 A023398001
The rleid
and rowid
functions from data.table can be useful here:
# using df from @Florian's answer
library(data.table)
setDT(df)
df[, v := paste0(
substr(at, 1, 1),
an,
sprintf("%03.f", rowid(rleid(an)))
)]
# at an v
# 1: NDA 023356 N023356001
# 2: ANDA 023357 A023357001
# 3: ANDA 023357 A023357002
# 4: NDA 023357 N023357003
# 5: ANDA 023398 A023398001
How it works:
sprintf
from base effectively does the job of stringr::str_pad
in the OP.rleid
groups runs of repeating values together.rowid
makes a counter within each group.In base R, you can use sprintf
to pad 0s and ave to get the counts like this:
df$ap <- paste0(substr(df$at, 1, 1), df$an,
sprintf("%03.0f", as.numeric(ave(df$an, df$an, FUN=seq_along))))
ave
performs the group calculations and seq_along
counts the rows.
which returns
df
at an ap
1 NDA 023356 N023356001
2 ANDA 023357 A023357001
3 ANDA 023357 A023357002
4 NDA 023357 N023357003
5 ANDA 023398 A023398001
This works:
library(stringr)
df = data.frame(at=c("NDA","ANDA","ANDA","NDA","ANDA"),an=c("023356","023357","023357","023357","023398"),stringsAsFactors = F)
df$ap = paste0(substr(df$at,1,1),
df$an,str_pad(ave(df$an, df$an, FUN = seq_along),width=3,pad="0"))
Output:
at an ap
1 NDA 023356 N023356001
2 ANDA 023357 A023357001
3 ANDA 023357 A023357002
4 NDA 023357 N023357003
5 ANDA 023398 A023398001
Hope this helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With