Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using fread() to select rows and columns, the way read.csv.sql() does

Tags:

r

data.table

I know fread is relatively new, but it really gives great performance improvements. What I want to know is, can you select rows and columns from the file that you are reading? A bit like what read.csv.sql does? I know using the select option of the fread one can select the columns to read, but how about reading only the rows which satisfy a certain criteria.

For example, can something like below be implemented using fread?

read.csv.sql(file, sql = "select V2,V4,V7,V8,V9, V10 from file where V5=='CE' and V10 >= 500",header = FALSE, sep= '|', eol ="\n")

If this is not possible yet, is it advisable to read the entire lot of data, and then use subset etc to arrive at the final result? Or will it defeat the purpose of using fread?

For reference, I have to read about 800 files, each containing about 100,000 rows and 10 columns. Any input is welcome.

Thanks.

like image 592
Shivam Avatar asked May 06 '14 19:05

Shivam


People also ask

Can fread read CSV?

We're able to successfully import the CSV file using the fread() function. Note: We used double backslashes (\\) in the file path to avoid a common import error. Notice that we didn't have to specify the delimiter either since the fread() function automatically detected that it was a comma.

What does fread mean in R?

Its fread() function is meant to import data from regular delimited files directly into R, without any detours or nonsense. Note that “regular” in this case means that every row of your data needs to have the same number of columns.

What package is fread?

table package comes with a function called fread which is a very efficient and speedy function for reading data from files. It is similar to read. table but faster and more convenient.


1 Answers

It is not possible to select rows with fread() as with read.csv.sql() yet. But it is still better to read the entire data (memory permitting) and then subset it as per your criteria. For a 200 mb file, fread()+ subset() gave ~ 4 times better performance than read.csv.sql().

So, using @Arun's suggestion,

ans = rbindlist(lapply(files, function(x) fread(x)[, fn := x]))
subset(ans, 'your criteria')

is better than the approach in the original question.

like image 178
Shivam Avatar answered Oct 14 '22 23:10

Shivam