Suppose I have a string array such like:
sa<-c("HLA:HLA00001 A*01:01:01:01 1098 bp",
"HLA:HLA01244 A*01:01:02 546 bp",
"HLA:HLA01971 A*01:01:03 895 bp")
My question is what is the best way to convert it to a data frame such like:
Seq Type Length
1 HLA:HLA00001 A*01:01:01:01 1098 bp
2 HLA:HLA01244 A*01:01:02 546 bp
3 HLA:HLA01971 A*01:01:03 895 bp
Using the ‹dplyr› and ‹tidyr› packages, this is trivial:
data_frame,separate columns:data_frame(sa) %>%
separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE)
Source: local data frame [3 x 3]
Seq Type Length
(chr) (chr) (int)
1 HLA:HLA00001 A*01:01:01:01 1098
2 HLA:HLA01244 A*01:01:02 546
3 HLA:HLA01971 A*01:01:03 895
This (intentionally) drops the unit from the last column, which is now redundant (as it will always be the same), and converts it to an integer. If you want to keep it, use extra = 'merge' instead.
You can further separate the Type column by the application of another ‹tidyr› function, quite similar to separate, but specifying which parts to match: extract. This function allows you to provide a regular expression (a must-learn tool if you don’t know it already!) that specifies which parts of a text to match. These parts are in parentheses here:
'(A\\*\\d{2}:\\d{2}):(.*)'
This means: extract two groups — the first group containing the string “A*” followed by two digits, “:” and another two digits. And the second group containing all the rest of the text, after a separating “:” (I hope I’ve captured the specification of HLA alleles correctly, I’ve never worked with this type of data).
Put together with the code from above:
data_frame(sa) %>%
separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE) %>%
extract(Type, c('Group', 'Allele'), regex = '(A\\*\\d{2}:\\d{2}):(.*)')
Source: local data frame [3 x 4]
Seq Group Allele Length
(chr) (chr) (chr) (int)
1 HLA:HLA00001 A*01:01 01:01 1098
2 HLA:HLA01244 A*01:01 02 546
3 HLA:HLA01971 A*01:01 03 895
Use read.table, which will require some extra effort since you have the delimiter within the column that you want to keep together:
df <- read.table(text = sa, col.names = c("Seq", "Type", "Length", "Unit"))
df$Length <- paste(df$Length, df$Unit)
df[,-4]
# Seq Type Length
# 1 HLA:HLA00001 A*01:01:01:01 1098 bp
# 2 HLA:HLA01244 A*01:01:02 546 bp
# 3 HLA:HLA01971 A*01:01:03 895 bp
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With