I know <code>fread</code> is relatively new, but it really gives great performance improvements. What I want to know is, can you select rows and columns from the file that you are reading? A bit like what <code>read.csv.sql</code> does? I know using the <code>select</code> option of the <code>fread</code> one can select the columns to read, but how about reading only the rows which satisfy a certain criteria. For example, can something like below be implemented using <code>fread</code>? <pre class="prettyprint"><code>read.csv.sql(file, sql = "select V2,V4,V7,V8,V9, V10 from file where V5=='CE' and V10 >= 500",header = FALSE, sep= '|', eol ="\n") </code></pre> If this is not possible yet, is it advisable to read the entire lot of data, and then use <code>subset</code> etc to arrive at the final result? Or will it defeat the purpose of using <code>fread</code>? For reference, I have to read about 800 files, each containing about 100,000 rows and 10 columns. Any input is welcome. Thanks.

It is not possible to select rows with <code>fread()</code> as with <code>read.csv.sql()</code> yet. But it is still better to read the entire data (memory permitting) and then subset it as per your criteria. For a 200 mb file, <code>fread()</code>+ <code>subset()</code> gave ~ 4 times better performance than <code>read.csv.sql()</code>. So, using @Arun's suggestion, <pre class="prettyprint"><code>ans = rbindlist(lapply(files, function(x) fread(x)[, fn := x])) subset(ans, 'your criteria') </code></pre> is better than the approach in the original question.

Using fread() to select rows and columns, the way read.csv.sql() does

Tags:

r

data.table

I know fread is relatively new, but it really gives great performance improvements. What I want to know is, can you select rows and columns from the file that you are reading? A bit like what read.csv.sql does? I know using the select option of the fread one can select the columns to read, but how about reading only the rows which satisfy a certain criteria.

For example, can something like below be implemented using fread?

read.csv.sql(file, sql = "select V2,V4,V7,V8,V9, V10 from file where V5=='CE' and V10 >= 500",header = FALSE, sep= '|', eol ="\n")

If this is not possible yet, is it advisable to read the entire lot of data, and then use subset etc to arrive at the final result? Or will it defeat the purpose of using fread?

For reference, I have to read about 800 files, each containing about 100,000 rows and 10 columns. Any input is welcome.

Thanks.

592

asked May 06 '14 19:05

Shivam

1 Answers

It is not possible to select rows with fread() as with read.csv.sql() yet. But it is still better to read the entire data (memory permitting) and then subset it as per your criteria. For a 200 mb file, fread()+ subset() gave ~ 4 times better performance than read.csv.sql().

So, using @Arun's suggestion,

ans = rbindlist(lapply(files, function(x) fread(x)[, fn := x]))
subset(ans, 'your criteria')

is better than the approach in the original question.

178

answered Oct 14 '22 23:10

Shivam

Related questions
                            
                                How can I scale an array to another length saving it's approximate values in R
                            
                                Standardized output of test statistics with \Sexpr
                            
                                Assigning names to the list output of dplyr do operation
                            
                                remove a row containing missing value in specific columns in R [duplicate]
                            
                                View built-in dataset from a package
                            
                                Stacking data.frames in a list into a single data.frame, maintaining names(list) as an extra column
                            
                                R: How to filter/smooth binary signal
                            
                                How to install doRedis package version 1.0.5 into R 3.0.1 on Windows? [duplicate]
                            
                                Performance Analytics error Error in na.omit.xts(x) : unsupported type
                            
                                Select row from data.table with min value
                            
                                How can I tell if R is still estimating my SVM model or has crashed?
                            
                                Rename variable names in stargazer latex table
                            
                                Precipitation plot, or mirrored histogram based on top axis
                            
                                ggmap map style repository? Now that CloudMade no longer gives out APIs
                            
                                Performance difference between RcppArmadillo and Armadillo
                            
                                How do I convert a logical variable to factor in Rattle
                            
                                Error when using rbind to merge data.tables and one of them is empty
                            
                                How to check in every row in a column if it contains a substring
                            
                                How to write lp object to lp file?
                            
                                How to combine and modify ggplot2 legends with ribbons and lines?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using fread() to select rows and columns, the way read.csv.sql() does

Tags:

r

data.table

Shivam

People also ask

1 Answers

Shivam

Recent Activity

Donate For Us