Convert string arrays to data frame in R

Question

Suppose I have a string array such like:

sa<-c("HLA:HLA00001 A*01:01:01:01 1098 bp",
      "HLA:HLA01244 A*01:01:02 546 bp",
      "HLA:HLA01971 A*01:01:03 895 bp")

My question is what is the best way to convert it to a data frame such like:

  Seq          Type             Length
1 HLA:HLA00001 A*01:01:01:01    1098 bp
2 HLA:HLA01244 A*01:01:02       546 bp
3 HLA:HLA01971 A*01:01:03       895 bp

Konrad Rudolph · Accepted Answer

Using the ‹dplyr› and ‹tidyr› packages, this is trivial:

Put data into a data_frame,
separate columns:

data_frame(sa) %>%
    separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE)

Source: local data frame [3 x 3]

           Seq          Type Length
         (chr)         (chr)  (int)
1 HLA:HLA00001 A*01:01:01:01   1098
2 HLA:HLA01244    A*01:01:02    546
3 HLA:HLA01971    A*01:01:03    895

This (intentionally) drops the unit from the last column, which is now redundant (as it will always be the same), and converts it to an integer. If you want to keep it, use extra = 'merge' instead.

You can further separate the Type column by the application of another ‹tidyr› function, quite similar to separate, but specifying which parts to match: extract. This function allows you to provide a regular expression (a must-learn tool if you don’t know it already!) that specifies which parts of a text to match. These parts are in parentheses here:

'(A\*\d{2}:\d{2}):(.*)'

This means: extract two groups — the first group containing the string “A*” followed by two digits, “:” and another two digits. And the second group containing all the rest of the text, after a separating “:” (I hope I’ve captured the specification of HLA alleles correctly, I’ve never worked with this type of data).

Put together with the code from above:

data_frame(sa) %>%
    separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE) %>%
    extract(Type, c('Group', 'Allele'), regex = '(A\*\d{2}:\d{2}):(.*)')

Source: local data frame [3 x 4]

           Seq   Group Allele Length
         (chr)   (chr)  (chr)  (int)
1 HLA:HLA00001 A*01:01  01:01   1098
2 HLA:HLA01244 A*01:01     02    546
3 HLA:HLA01971 A*01:01     03    895

Psidom · Answer

Use read.table, which will require some extra effort since you have the delimiter within the column that you want to keep together:

df <- read.table(text = sa, col.names = c("Seq", "Type", "Length", "Unit"))
df$Length <- paste(df$Length, df$Unit)
df[,-4]
#            Seq          Type  Length
# 1 HLA:HLA00001 A*01:01:01:01 1098 bp
# 2 HLA:HLA01244    A*01:01:02  546 bp
# 3 HLA:HLA01971    A*01:01:03  895 bp

Convert string arrays to data frame in R

Tags:

dataframe

r

David Z

2 Answers

Konrad Rudolph

Psidom

Recent Activity

Donate For Us

Convert string arrays to data frame in R

Tags:

dataframe

r

David Z

2 Answers

Konrad Rudolph

Psidom

Related questions

Recent Activity

Donate For Us