Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Load a small random sample from a large csv file into R data frame

The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame?

like image 229
P.Escondido Avatar asked Mar 07 '14 21:03

P.Escondido


People also ask

How do I open a large csv file in R?

If the CSV files are extremely large, the best way to import into R is using the fread() method from the data. table package. The output of the data will be in the form of Data table in this case.

How do I import a csv dataset into RStudio?

In RStudio, click on the Workspace tab, and then on “Import Dataset” -> “From text file”. A file browser will open up, locate the . csv file and click Open. You'll see a dialog that gives you a few options on the import.

How do I pull a CSV file into R?

To load a. csv file into the current script and operate with it, use the read. csv() method in base R. The output is delivered as a data frame, with row numbers given to integers starting at 1.


2 Answers

You can also just do it in the terminal with perl.

perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt

This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the smaller file into R however you want.

like image 163
Jed Avatar answered Sep 18 '22 13:09

Jed


Try this based on examples 6e and 6f on the sqldf github home page:

library(sqldf)
DF <- read.csv.sql("x.csv", sql = "select * from file order by random() limit 20000")

See ?read.csv.sql using other arguments as needed based on the particulars of your file.

like image 21
G. Grothendieck Avatar answered Sep 21 '22 13:09

G. Grothendieck