Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample A CSV File Too Large To Load Into R?

Tags:

r

I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.

Is this possible? I cannot seem to find an answer anywhere.

like image 494
Anton Avatar asked Nov 24 '13 14:11

Anton


People also ask

How do I open a big CSV file in R?

Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.

How much data can you load into R?

R Objects live in memory entirely. Not possible to index objects with huge numbers of rows & columns even in 64 bit systems (2 Billion vector index limit) . Hits file size limit around 2-4 GB.

Is there a size limit on CSV files?

csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows.


1 Answers

If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.

And that step is easier to happen outside R.

(1) Linux Shell:

Assuming your data falls into a consistent format. Each row is one record. You can do:

sort -R data | head -n 1000 >data.sample

This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample

(2) If the data is not small enough to fit into memory.

There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:

select * from tablename order by rand() limit 1000

You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.

These are the two most commonly used ways based on my experience for dealing with 'big' data.

like image 172
B.Mr.W. Avatar answered Nov 15 '22 03:11

B.Mr.W.