I had written a previous (similar) post here where I was trying to create a wide table as opposed to a long table. I realized that its best to have my table in the long format so I am posting it as a different question. I am also posting what I have tried. I am using <code>R</code> to rbind about ~11000 files using: <pre class="prettyprint"><code># get list of ~11000 files lfiles <- list.files(pattern = "*.tsv", full.names = TRUE) # row-bind the files # use rbindlist to rbind and fread to read files # use mclapply I am assigning 32 cores to it # add the file basename as the id to identify rows dat <- rbindlist(mclapply(lfiles, function(X) { data.frame(id = basename(tools::file_path_sans_ext(X)), fread(X))},mc.cores = 32)) </code></pre> I am using R because my downstream processing like creating plots etc is in R. I have two questions: 1. Is there a way I can make my code more efficient/faster? I know the number of rows expected at the end so will it help if I preallocate the dataframe? 2. How should I save (in what format) this huge data - as .RData or as a database or something else? As an additional info: I have three types of files for which I want this done. They look like this: <pre class="prettyprint"><code>[centos@ip data]$ head C021_0011_001786_tumor_RNASeq.abundance.tsv target_id length eff_length est_counts tpm ENST00000619216.1 68 26.6432 10.9074 5.69241 ENST00000473358.1 712 525.473 0 0 ENST00000469289.1 535 348.721 0 0 ENST00000607096.1 138 15.8599 0 0 ENST00000417324.1 1187 1000.44 0.0673096 0.000935515 ENST00000461467.1 590 403.565 3.22654 0.11117 ENST00000335137.3 918 731.448 0 0 ENST00000466430.5 2748 2561.44 162.535 0.882322 ENST00000495576.1 1319 1132.44 0 0 [centos@ip data]$ head C021_0011_001786_tumor_RNASeq.rsem.genes.norm_counts.hugo.tab gene_id C021_0011_001786_tumor_RNASeq TSPAN6 1979.7185 TNMD 1.321 DPM1 1878.8831 SCYL3 452.0372 C1orf112 203.6125 FGR 494.049 CFH 509.8964 FUCA2 1821.6096 GCLC 1557.4431 [centos@ip data]$ head CPBT_0009_1_tumor_RNASeq.rsem.genes.norm_counts.tab gene_id CPBT_0009_1_tumor_RNASeq ENSG00000000003.14 2005.0934 ENSG00000000005.5 5.0934 ENSG00000000419.12 1100.1698 ENSG00000000457.13 2376.9100 ENSG00000000460.16 1536.5025 ENSG00000000938.12 443.1239 ENSG00000000971.15 1186.5365 ENSG00000001036.13 1091.6808 ENSG00000001084.10 1602.7165 </code></pre> Any help would be much appreciated! Thanks!

You can't do this faster than using <code>fread</code> and <code>rbindlist</code> in R. But, you should not use <code>data.frame</code> and copy the data. Instead assign by reference: <pre class="prettyprint"><code>DF <- fread(X) DF[, id := basename(tools::file_path_sans_ext(X))] return(DF) </code></pre> However, you should consider using a database. PS: The correct regex is <code>".+\\.tsv$"</code>. This matches any file name with one or more characters followed by a dot and the string "tsv" followed by the end of the file name.

R: Row bind very large number of files in a fast manner

Tags:

r

I had written a previous (similar) post here where I was trying to create a wide table as opposed to a long table. I realized that its best to have my table in the long format so I am posting it as a different question. I am also posting what I have tried.

I am using R to rbind about ~11000 files using:

# get list of ~11000 files
lfiles <- list.files(pattern = "*.tsv", full.names = TRUE)

# row-bind the files
# use rbindlist to rbind and fread to read files
# use mclapply I am assigning 32 cores to it
# add the file basename as the id to identify rows
dat <- rbindlist(mclapply(lfiles, function(X) {
data.frame(id = basename(tools::file_path_sans_ext(X)),
           fread(X))},mc.cores = 32))

I am using R because my downstream processing like creating plots etc is in R. I have two questions:

1. Is there a way I can make my code more efficient/faster? I know the number of rows expected at the end so will it help if I preallocate the dataframe?

2. How should I save (in what format) this huge data - as .RData or as a database or something else?

As an additional info: I have three types of files for which I want this done. They look like this:

[centos@ip data]$ head C021_0011_001786_tumor_RNASeq.abundance.tsv
target_id   length  eff_length  est_counts  tpm
ENST00000619216.1   68  26.6432 10.9074 5.69241
ENST00000473358.1   712 525.473 0   0
ENST00000469289.1   535 348.721 0   0
ENST00000607096.1   138 15.8599 0   0
ENST00000417324.1   1187    1000.44 0.0673096   0.000935515
ENST00000461467.1   590 403.565 3.22654 0.11117
ENST00000335137.3   918 731.448 0   0
ENST00000466430.5   2748    2561.44 162.535 0.882322
ENST00000495576.1   1319    1132.44 0   0

[centos@ip data]$ head C021_0011_001786_tumor_RNASeq.rsem.genes.norm_counts.hugo.tab
gene_id C021_0011_001786_tumor_RNASeq
TSPAN6  1979.7185
TNMD    1.321
DPM1    1878.8831
SCYL3   452.0372
C1orf112    203.6125
FGR 494.049
CFH 509.8964
FUCA2   1821.6096
GCLC    1557.4431

[centos@ip data]$ head CPBT_0009_1_tumor_RNASeq.rsem.genes.norm_counts.tab
gene_id CPBT_0009_1_tumor_RNASeq
ENSG00000000003.14  2005.0934
ENSG00000000005.5   5.0934
ENSG00000000419.12  1100.1698
ENSG00000000457.13  2376.9100
ENSG00000000460.16  1536.5025
ENSG00000000938.12  443.1239
ENSG00000000971.15  1186.5365
ENSG00000001036.13  1091.6808
ENSG00000001084.10  1602.7165

Any help would be much appreciated!

Thanks!

973

asked Aug 09 '16 14:08

Komal Rathi

1 Answers

You can't do this faster than using fread and rbindlist in R. But, you should not use data.frame and copy the data. Instead assign by reference:

DF <- fread(X)
DF[, id := basename(tools::file_path_sans_ext(X))]
return(DF)

However, you should consider using a database.

PS: The correct regex is ".+\\.tsv$". This matches any file name with one or more characters followed by a dot and the string "tsv" followed by the end of the file name.

155

answered Sep 18 '22 15:09

Roland

Related questions
                            
                                Use row and columns indices in matrix to extract values from matrix
                            
                                rmarkdown error with ggplot and png
                            
                                Determine which column name is causing 'undefined columns selected' error when using subset()
                            
                                Test for equality between all members of list
                            
                                Custom model names in Stargazer package for R
                            
                                Can knitr dynamically output narrative text based on R code results in each chunk?
                            
                                Create custom ggplot2 function that combines theme and color
                            
                                What is the difference between a manual and a vignette?
                            
                                Error when using "diff" function inside of dplyr mutate
                            
                                How to change pandoc option in R Studio
                            
                                Mutate to Create Minimum in Each Row
                            
                                Remove iterations of carett::train() on knit html output
                            
                                Is there a way to access R data frame column names in python/rpy2?
                            
                                Replicating seed setting from Stata
                            
                                Knitr: one plot per .tabset in for-loop
                            
                                R shows NA although a value is present
                            
                                Error: $ operator not defined for this S4 class
                            
                                Reshaping data.table with cumulative sum
                            
                                data.table version of tidyr::unite
                            
                                Combinations by group in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With