I am relatively new in the "large data process" in r here, hope to look for some advise about how to deal with 50 GB csv file. The current problem is following:
Table is looked like:
ID,Address,City,States,... (50 more fields of characteristics of a house)
1,1,1st street,Chicago,IL,...
# the first 1 is caused by write.csv, they created an index raw in the file
I would like to find all rows that is belonging San Francisco, CA. It supposed to be an easy problem, but the csv is too large.
I know I have two ways of doing it in R and another way to use database to handle it:
(1) Using R's ffdf packages:
since last time the file is saved, it was using write.csv and it contains all different types.
all <- read.csv.ffdf(
file="<path of large file>",
sep = ",",
header=TRUE,
VERBOSE=TRUE,
first.rows=10000,
next.rows=50000,
)
the console gives me this:
Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered,
: vmode 'character' not implemented
Searching through online, I found several answers which did not fit my case, and I can't really make sense of how to transfer "character" into "factor" type as they mentioned.
Then I tried using read.table.ffdf, this is even more disaster. I can't find a solid guide for that one.
(2) Using R's readline:
I know this is another good way, but can't find an effecient way to do this.
(3) Using SQL:
I am not sure how to transfer the file into SQL version, and how to handle this, if there is a good guide I would like to try. But in general, I would like to stick with R.
Thanks for reply and help!
So, how do you open large CSV files in Excel? Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.
Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.
csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows.
The fread function in this package, for example, can read large flat files in much more quickly than comparable base R packages.
The contents of a CSV file can be read as a data frame in R using the read.csv (…) function. The CSV file to be read should be either present in the current working directory or the directory should be set accordingly using the setwd (…) command in R.
You can use data.table to handle reading and manipulating large files more efficiently: If needed, you can leverage storage memory with ff: Show activity on this post. You might want to consider leveraging some on-disk processing and not have that entire object in R's memory.
x = sample (c ("foofoofoo","barbarbar"),10000000,replace = T) gives a factor of 0.5x (R:csv). Based on the max, your 9GB file would take a potential 18GB of memory to store in R, if not more.
The head is not included in the count of rows, therefore this CSV has 7 rows and 4 columns. SQL queries can be performed on the CSV content, and the corresponding result can be retrieved using the subset (csv_data,) function in R. Multiple queries can be applied in the function at a time where each query is separated using a logical operator.
You can use R with SQLite behind the curtains with the sqldf package. You'd use the read.csv.sql
function in the sqldf
package and then you can query the data however you want to obtain the smaller data frame.
The example from the docs:
library(sqldf)
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where Species = 'setosa' ")
I've used this library on VERY large CSV files with good results.
R -- in its basic configuration -- loads data into memory. Memory is cheap. 50 GB still is not a typical configuration (and you would need more than that to load the data in and store it). If you are really good in R, you might be able to figure out another mechanism. If you have access to a cluster, you could use some parallel version of R or Spark.
You could also load the data into a database. For the task at hand, a database is very well suited to the problem. R easily connects to almost any database. And, you might find a database very useful for what you want to do.
Or, you could just process the text file in situ. Command line tools such as awk, grep, and perl are very suitable for this task. I would recommend this approach for a one-time effort. I would recommend a database if you want to keep the data around for analytic purposes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With