I wasn't sure how to word this correctly while searching, so sorry if this has a simple answer.
I have 58 dataframes with ~25,000 rows each that I am getting from .csv's. They look something like this:
Probe.Id Gene.Id Score.d
1418126_at 6352 28.52578
145119_a_at 2192 24.87866
1423477_at NA 24.43532
1434193_at 100506144///9204 6.22395
Ideally I want to split the Ids at the "///"s and get them on new rows. Like so:
Probe.Id Gene.Id Score.d
1418126_at 6352 28.52578
145119_a_at 2192 24.87866
1423477_at NA 24.43532
1434193_at 100506144 6.22395
1434193_at 9204 6.22395
Using strsplit allows me to get Gene.Id as a list of character vectors, but once I have that I'm not sure what the most effective way is to get each of the individual ids on their own row with the correct values from the other columns. Ideally I don't want to just be looping through 25,000 rows.
If anyone knows the right way to do this I'd super appreciate it.
EDIT: I should've added that there's a complicating factor in that there are rows which have ids like so:
333932///126961///653604///8350///8354///8355///8356///8968///8352///8358///8351///8353///8357"
and I have no idea what the maximum number of ids in a row is.
Edit: New solution after OP's comment. Very straightforward using data.table
:
df <- structure(list(Probe.Id = c("1418126_at", "145119_a_at", "1423477_at",
"1434193_at", "100_at"), Gene.Id = c("6352", "2192", NA,
"100506144///9204", "100506144///100506146///100506148///100506150"),
Score.d = c(28.52578, 24.87866, 24.43532, 6.22395, 6.22395)),
.Names = c("Probe.Id", "Gene.Id", "Score.d"), row.names = c(NA, 5L),
class = "data.frame")
require(data.table)
dt <- data.table(df)
dt.out <- dt[, list(Probe.Id = Probe.Id,
Gene.Id = unlist(strsplit(Gene.Id, "///")),
Score.d = Score.d), by=1:nrow(dt)]
> dt.out
# nrow Probe.Id Gene.Id Score.d
# 1: 1 1418126_at 6352 28.52578
# 2: 2 145119_a_at 2192 24.87866
# 3: 3 1423477_at NA 24.43532
# 4: 4 1434193_at 100506144 6.22395
# 5: 4 1434193_at 9204 6.22395
# 6: 5 100_at 100506144 6.22395
# 7: 5 100_at 100506146 6.22395
# 8: 5 100_at 100506148 6.22395
# 9: 5 100_at 100506150 6.22395
You could add fixed = TRUE
to the strsplit
expression to speedup further, if ///
is a fixed pattern.
Alternative Again using data.table
. Taking into consideration that strsplit
is a vectorised operation and that running it on the whole of Gene.Id
column would be much faster than running it thro' 1 row at a time (even though data.table
runs thro' very fast, you could get more speedup by splitting the previous code into 2 steps:
# first split using strsplit (data.table can hold list in its columns!!)
dt[, Gene.Id_split := strsplit(dt$Gene.Id, "///", fixed=TRUE)]
# then just unlist them
dt.2 <- dt[, list(Probe.Id = Probe.Id,
Gene.Id = unlist(Gene.Id_split),
Score.d = Score.d), by = 1:nrow(dt)]
I just replicated the data.table
shown in this example many times until I got 295245
rows. And then I ran a benchmark using rbenchmark
:
# first function
DT1 <- function() {
dt.1 <- dt[, list(Probe.Id = Probe.Id,
Gene.Id = unlist(strsplit(Gene.Id, "///", fixed = TRUE)),
Score.d = Score.d), by=1:nrow(dt)]
}
# expected to be faster function
DT2 <- function() {
dt[, Gene.Id_split := strsplit(dt$Gene.Id, "///", fixed=TRUE)]
# then just unlist them
dt.2 <- dt[, list(Probe.Id = Probe.Id, Gene.Id = unlist(Gene.Id_split), Score.d = Score.d), by = 1:nrow(dt)]
}
require(rbenchmark)
benchmark(DT1(), DT2(), replications=10, order="elapsed")
# test replications elapsed relative user.self sys.self
# 2 DT2() 10 15.708 1.000 14.390 0.391
# 1 DT1() 10 24.957 1.589 23.723 0.436
For this example, you get about 1.6 times faster. But this depends on the number of entries with ///
. Hope this helps.
OLD solution: (for continuity)
One way is to: 1) find the positions
where this ///
occurs, 2) extract
, 3) duplicate
, 4) sub
and 5) combine
them.
df <- structure(list(Probe.Id = structure(c(1L, 4L, 2L, 3L),
.Label = c("1418126_at", "1423477_at", "1434193_at", "145119_a_at"),
class = "factor"), Gene.Id = structure(c(3L, 2L, NA, 1L),
.Label = c("100506144///9204", "2192", "6352"), class = "factor"),
Score.d = c(28.52578, 24.87866, 24.43532, 6.22395)),
.Names = c("Probe.Id", "Gene.Id", "Score.d"),
class = "data.frame", row.names = c(NA, -4L))
# 1) get the positions of "///"
idx <- grepl("[/]{3}", df$Gene.Id)
# 2) create 3 data.frames
df1 <- df[!idx, ] # don't touch this.
df2 <- df[idx, ] # we need to work on this
# 3) duplicate
df3 <- df2 # duplicate it.
4) sub
df2$Gene.Id <- sub("[/]{3}.*$", "", df2$Gene.Id) # replace the end
df3$Gene.Id <- sub("^.*[/]{3}", "", df3$Gene.Id) # replace the beginning
# 5) combine/put them back
df.out <- rbind(df1, df2, df3)
# if necessary sort them here.
here a solution using strsplit
and merge
dat <- read.table(text ='Probe.Id Gene.Id Score.d
1418126_at 6352 28.52578
145119_a_at 2192 24.87866
1423477_at NA 24.43532
1434193_at 100506144///9204 6.22395',header=T,stringsAsFactors=F)
dat1 <- dat
xx <- do.call(rbind,strsplit(dat$Gene.Id,split='///'))
dat[which(xx[,1]!=xx[,2]),2] <- xx[which(xx[,1]!=xx[,2]),1]
dat1[which(xx[,1]!=xx[,2]),2] <- xx[which(xx[,1]!=xx[,2]),2]
merge(dat,dat1,all.y=T,all.x=T)
Probe.Id Gene.Id Score.d
1 1418126_at 6352 28.52578
2 1423477_at <NA> 24.43532
3 1434193_at 100506144 6.22395
4 1434193_at 9204 6.22395
5 145119_a_at 2192 24.87866
Here's a method that uses the constructor for data.frame
, using the "feature" that it silently recycles input vectors:
do.call(rbind,
apply(dat, 1, function(x)
data.frame(Probe.ID=x['Probe.Id'],
Gene.Id=strsplit(x['Gene.Id'], '///'),
Score.d=x['Score.d'],
row.names=NULL
)
)
)
## Probe.ID Gene.Id Score.d
## 1 1418126_at 6352 28.52578
## 2 145119_a_at 2192 24.87866
## 3 1423477_at <NA> 24.43532
## 4 1434193_at 100506144 6.22395
## 5 1434193_at 9204 6.22395
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With