The csv file to be processed does not fit into the memory. How can one read ~20K random lines of it to do basic statistics on the selected data frame?
If the CSV files are extremely large, the best way to import into R is using the fread() method from the data. table package. The output of the data will be in the form of Data table in this case.
In RStudio, click on the Workspace tab, and then on “Import Dataset” -> “From text file”. A file browser will open up, locate the . csv file and click Open. You'll see a dialog that gives you a few options on the import.
To load a. csv file into the current script and operate with it, use the read. csv() method in base R. The output is delivered as a data frame, with row numbers given to integers starting at 1.
You can also just do it in the terminal with perl.
perl -ne 'print if (rand() < .01)' biglist.txt > subset.txt
This won't necessarily get you exactly 20,000 lines. (Here it'll grab about .01 or 1% of the total lines.) It will, however, be really really fast, and you'll have a nice copy of both files in your directory. You can then load the smaller file into R however you want.
Try this based on examples 6e and 6f on the sqldf github home page:
library(sqldf)
DF <- read.csv.sql("x.csv", sql = "select * from file order by random() limit 20000")
See ?read.csv.sql
using other arguments as needed based on the particulars of your file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With