I have some very big delimited data files and I want to process only certain columns in R without taking the time and memory to create a <code>data.frame</code> for the whole file. The only options I know of are <code>read.table</code> which is very wasteful when I only want a couple of columns or <code>scan</code> which seems too low level for what I want. Is there a better option, either with pure R or perhaps calling out to some other shell script to do the column extraction and then using scan or read.table on it's output? (Which leads to the question how to call a shell script and capture its output in R?).

Sometimes I do something like this when I have the data in a tab-delimited file: <pre class="prettyprint"><code>df <- read.table(pipe("cut -f1,5,28 myFile.txt")) </code></pre> That lets <code>cut</code> do the data selection, which it can do without using much memory at all. See Only read limited number of columns for pure R version, using <code>"NULL"</code> in the <code>colClasses</code> argument to <code>read.table</code>.

Ways to read only select columns from a file into R? (A happy medium between `read.table` and `scan`?) [duplicate]

Tags:

r

read.table

data-processing

delimited

I have some very big delimited data files and I want to process only certain columns in R without taking the time and memory to create a data.frame for the whole file.

The only options I know of are read.table which is very wasteful when I only want a couple of columns or scan which seems too low level for what I want.

Is there a better option, either with pure R or perhaps calling out to some other shell script to do the column extraction and then using scan or read.table on it's output? (Which leads to the question how to call a shell script and capture its output in R?).

293

asked Feb 03 '10 17:02

Alex Stoddard

1 Answers

Sometimes I do something like this when I have the data in a tab-delimited file:

df <- read.table(pipe("cut -f1,5,28 myFile.txt"))

That lets cut do the data selection, which it can do without using much memory at all.

See Only read limited number of columns for pure R version, using "NULL" in the colClasses argument to read.table.

166

answered Sep 18 '22 08:09

Ken Williams

Related questions
                            
                                What's the difference in using a semicolon or explicit new line in R code
                            
                                Difference between c() and append()
                            
                                Add link to R Shiny Application so link opens in a new browser tab
                            
                                Rbuildignore and Excluding Directories
                            
                                Complete remove and reinstall R, including all packages
                            
                                Replace single backslash in R
                            
                                Why do powers of 10 print in scientific notation at the 5th power?
                            
                                Is there a vectorized parallel max() and min()?
                            
                                Perform a semi-join with data.table
                            
                                Add color to boxplot - "Continuous value supplied to discrete scale" error
                            
                                ggplot2 cheat sheet [closed]
                            
                                including a interactive 3D figure with knitr
                            
                                Recommended package for very large dataset processing and machine learning in R [closed]
                            
                                how to adjust future.global.maxSize in R?
                            
                                Converting units in R
                            
                                Capturing Rscript errors in an output file
                            
                                Why is the parallel package slower than just using apply?
                            
                                different size facets proportional of x axis on ggplot 2 r
                            
                                setting values for ntree and mtry for random forest regression model
                            
                                Declaring a Const Variable in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With