Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a subset of large dataset in R?

Tags:

r

read.table

I have a dataset with about 2 million rows, so without reading the whole dataset I want to read a subset of dataset . My dataset contains a date column in it so I just want to read dataset between a date range without reading whole dataset as it will be time consuming and memory waste. so how to accomplish it can anyone guide me on this ?

like image 408
Zeeshan shaikh Avatar asked Sep 19 '14 11:09

Zeeshan shaikh


1 Answers

Use skip= parameter in read.table

read.table("file.txt",skip= ,nrows= )

Both the skip= and nrows= take in row indicator numbers so just add them after the=.

The nrows= defines how deep you range when you are importing the file.

I suggest reading https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html if you haven't done so already.

Also, please see one of my questions:

R - Reading lines from a .txt-file after a specific line

It, somewhat, touches the same subject.

The other possible way might be to use grep() in skip=

read.table(...,skip=grep("2005-12-31", readLines("File.txt")),nrows=365)

What this line does is it skips until it finds the line depicted in grep() and reads the lines after that. The nrow= will stop the reading after it has read 365 lines (this way you have read one year of dates provided one line equals one date).

This seems kinda complicated, but it's the only way I know how to solve this.

like image 51
Olli J Avatar answered Oct 04 '22 06:10

Olli J