I have a large dataset where all column headers are individual IDS, each 8 characters in length. I would like to split those individual IDs into 2 rows, where the first row of IDs contains the first 7 characters, and the second row contains just the last character.
Current dataset:
ID1: Indiv01A Indiv01B Indiv02A Indiv02B Speci03A Speci03B
Intended dataset:
ID1: Indiv01 Indiv01 Indiv02 Indiv02 Speci03 Speci03
ID2: A B A B A B
I've looked through other posts on splitting data, but they all seem to have a unique way to separate the column name (ie: there's a comma separating the 2 components, or a period).
This is the code I'm thinking would work best, but I just can't figure out how to code for "7 characters" as the split point, rather than a comma:
sapply(strsplit(as.character(d$ID), ",")
Any help would be appreciated.
Here's a regular expression for a solution with strsplit
. It splits the string between the 7th and the 8th character:
ID1 <- c("Indiv01A", "Indiv01B", "Indiv02A", "Indiv02B", "Speci03A", "Speci03B")
res <- strsplit(ID1, "(?<=.{7})", perl = TRUE)
# [[1]]
# [1] "Indiv01" "A"
#
# [[2]]
# [1] "Indiv01" "B"
#
# [[3]]
# [1] "Indiv02" "A"
#
# [[4]]
# [1] "Indiv02" "B"
#
# [[5]]
# [1] "Speci03" "A"
#
# [[6]]
# [1] "Speci03" "B"
Now, you can use rbind
to create two columns:
do.call(rbind, res)
# [,1] [,2]
# [1,] "Indiv01" "A"
# [2,] "Indiv01" "B"
# [3,] "Indiv02" "A"
# [4,] "Indiv02" "B"
# [5,] "Speci03" "A"
# [6,] "Speci03" "B"
Explanation of the regex pattern:
(?<=.{7})
The (?<=)
is a (positive) lookbehind. It matches any position that is preceded by the specified pattern. Here, the pattern is .{7}
. The dot (.
) matches any character. {7}
means 7 times. Hence, the regex matches the position that is preceded by exactly 7 characters.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With