Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to split character vectors into new rows in a dataframe

Tags:

split

r

I wasn't sure how to word this correctly while searching, so sorry if this has a simple answer.

I have 58 dataframes with ~25,000 rows each that I am getting from .csv's. They look something like this:

Probe.Id     Gene.Id             Score.d
1418126_at   6352                28.52578
145119_a_at  2192                24.87866
1423477_at   NA                  24.43532
1434193_at   100506144///9204    6.22395

Ideally I want to split the Ids at the "///"s and get them on new rows. Like so:

Probe.Id     Gene.Id             Score.d
1418126_at   6352                28.52578
145119_a_at  2192                24.87866
1423477_at   NA                  24.43532
1434193_at   100506144           6.22395
1434193_at   9204                6.22395

Using strsplit allows me to get Gene.Id as a list of character vectors, but once I have that I'm not sure what the most effective way is to get each of the individual ids on their own row with the correct values from the other columns. Ideally I don't want to just be looping through 25,000 rows.

If anyone knows the right way to do this I'd super appreciate it.

EDIT: I should've added that there's a complicating factor in that there are rows which have ids like so:

333932///126961///653604///8350///8354///8355///8356///8968///8352///8358///835‌​1///8353///8357" 

and I have no idea what the maximum number of ids in a row is.

like image 454
AAlleavitch Avatar asked Jan 29 '13 00:01

AAlleavitch


3 Answers

Edit: New solution after OP's comment. Very straightforward using data.table:

df <- structure(list(Probe.Id = c("1418126_at", "145119_a_at", "1423477_at", 
        "1434193_at", "100_at"), Gene.Id = c("6352", "2192", NA, 
        "100506144///9204", "100506144///100506146///100506148///100506150"), 
         Score.d = c(28.52578, 24.87866, 24.43532, 6.22395, 6.22395)), 
        .Names = c("Probe.Id", "Gene.Id", "Score.d"), row.names = c(NA, 5L), 
        class = "data.frame")

require(data.table)
dt <- data.table(df)
dt.out <- dt[, list(Probe.Id = Probe.Id, 
          Gene.Id = unlist(strsplit(Gene.Id, "///")), 
          Score.d = Score.d), by=1:nrow(dt)]

> dt.out

#    nrow    Probe.Id   Gene.Id  Score.d
# 1:    1  1418126_at      6352 28.52578
# 2:    2 145119_a_at      2192 24.87866
# 3:    3  1423477_at        NA 24.43532
# 4:    4  1434193_at 100506144  6.22395
# 5:    4  1434193_at      9204  6.22395
# 6:    5      100_at 100506144  6.22395
# 7:    5      100_at 100506146  6.22395
# 8:    5      100_at 100506148  6.22395
# 9:    5      100_at 100506150  6.22395

You could add fixed = TRUE to the strsplit expression to speedup further, if /// is a fixed pattern.

Alternative Again using data.table. Taking into consideration that strsplit is a vectorised operation and that running it on the whole of Gene.Id column would be much faster than running it thro' 1 row at a time (even though data.table runs thro' very fast, you could get more speedup by splitting the previous code into 2 steps:

# first split using strsplit (data.table can hold list in its columns!!)
dt[, Gene.Id_split := strsplit(dt$Gene.Id, "///", fixed=TRUE)]
# then just unlist them
dt.2 <- dt[, list(Probe.Id = Probe.Id, 
                  Gene.Id = unlist(Gene.Id_split), 
                  Score.d = Score.d), by = 1:nrow(dt)]

I just replicated the data.table shown in this example many times until I got 295245 rows. And then I ran a benchmark using rbenchmark:

# first function
DT1 <- function() {
    dt.1 <- dt[, list(Probe.Id = Probe.Id, 
             Gene.Id = unlist(strsplit(Gene.Id, "///", fixed = TRUE)), 
             Score.d = Score.d), by=1:nrow(dt)]
}

# expected to be faster function
DT2 <- function() {
    dt[, Gene.Id_split := strsplit(dt$Gene.Id, "///", fixed=TRUE)]
    # then just unlist them
    dt.2 <- dt[, list(Probe.Id = Probe.Id, Gene.Id = unlist(Gene.Id_split), Score.d = Score.d), by = 1:nrow(dt)]
}

require(rbenchmark)
benchmark(DT1(), DT2(), replications=10, order="elapsed")

#    test replications elapsed relative user.self sys.self
# 2 DT2()           10  15.708    1.000    14.390    0.391
# 1 DT1()           10  24.957    1.589    23.723    0.436

For this example, you get about 1.6 times faster. But this depends on the number of entries with ///. Hope this helps.

OLD solution: (for continuity)

One way is to: 1) find the positions where this /// occurs, 2) extract, 3) duplicate, 4) sub and 5) combine them.

df <- structure(list(Probe.Id = structure(c(1L, 4L, 2L, 3L), 
         .Label = c("1418126_at", "1423477_at", "1434193_at", "145119_a_at"), 
         class = "factor"), Gene.Id = structure(c(3L, 2L, NA, 1L), 
         .Label = c("100506144///9204", "2192", "6352"), class = "factor"), 
         Score.d = c(28.52578, 24.87866, 24.43532, 6.22395)), 
         .Names = c("Probe.Id", "Gene.Id", "Score.d"), 
         class = "data.frame", row.names = c(NA, -4L))

# 1) get the positions of "///"
idx <- grepl("[/]{3}", df$Gene.Id)

# 2) create 3 data.frames
df1 <- df[!idx, ] # don't touch this.
df2 <- df[idx, ] # we need to work on this

# 3) duplicate
df3 <- df2 # duplicate it.

4) sub    
df2$Gene.Id <- sub("[/]{3}.*$", "", df2$Gene.Id) # replace the end
df3$Gene.Id <- sub("^.*[/]{3}", "", df3$Gene.Id) # replace the beginning

# 5) combine/put them back
df.out <- rbind(df1, df2, df3)

# if necessary sort them here.
like image 167
Arun Avatar answered Nov 13 '22 05:11

Arun


here a solution using strsplit and merge

dat <- read.table(text ='Probe.Id     Gene.Id             Score.d
1418126_at   6352                28.52578
145119_a_at  2192                24.87866
1423477_at   NA                  24.43532
1434193_at   100506144///9204    6.22395',header=T,stringsAsFactors=F)
dat1 <- dat
xx <- do.call(rbind,strsplit(dat$Gene.Id,split='///'))
dat[which(xx[,1]!=xx[,2]),2]  <- xx[which(xx[,1]!=xx[,2]),1]
dat1[which(xx[,1]!=xx[,2]),2]  <- xx[which(xx[,1]!=xx[,2]),2]
  merge(dat,dat1,all.y=T,all.x=T)
     Probe.Id   Gene.Id  Score.d
1  1418126_at      6352 28.52578
2  1423477_at      <NA> 24.43532
3  1434193_at 100506144  6.22395
4  1434193_at      9204  6.22395
5 145119_a_at      2192 24.87866
like image 39
agstudy Avatar answered Nov 13 '22 03:11

agstudy


Here's a method that uses the constructor for data.frame, using the "feature" that it silently recycles input vectors:

do.call(rbind, 
        apply(dat, 1, function(x) 
                         data.frame(Probe.ID=x['Probe.Id'], 
                                    Gene.Id=strsplit(x['Gene.Id'], '///'),
                                    Score.d=x['Score.d'],
                                    row.names=NULL
                                   )
             )
        )
##      Probe.ID   Gene.Id  Score.d
## 1  1418126_at      6352 28.52578
## 2 145119_a_at      2192 24.87866
## 3  1423477_at      <NA> 24.43532
## 4  1434193_at 100506144  6.22395
## 5  1434193_at      9204  6.22395
like image 27
Matthew Lundberg Avatar answered Nov 13 '22 04:11

Matthew Lundberg