Reading around, I found out that the best way to read a larger-than-memory csv file is to use read.csv.sql
from package sqldf
. This function will read the data directly into a sqlite database, and consequently execute a sql statement.
I noticed the following: it seems that the data read into sqlite is stored into a temporary table, so that in order to make it accessible for future use, it needs to be asked so in the sql statement.
As a example, the following code reads some sample data into sqlite:
# generate sample data
sample_data <- data.frame(col1 = sample(letters, 100000, TRUE), col2 = rnorm(100000))
# save as csv
write.csv(sample_data, "sample_data.csv", row.names = FALSE)
# create a sample sqlite database
library(sqldf)
sqldf("attach sample_db as new")
# read the csv into the database and create a table with its content
read.csv.sql("sample_data.csv", sql = "create table data as select * from file",
dbname = "sample_db", header = T, row.names = F, sep = ",")
The data can then be accessed with sqldf("select * from data limit 5", dbname = "sample_db")
.
The problem is the following: the sqlite file takes up twice as much space as it should. My guess is that it contains the data twice: once for the temporary read, and once for the stored table. It is possible to clean up the database with sqldf("vacuum", dbname = "sample_db")
. This will reclaim the empty space, but it takes a long time, especially when the file is big.
Is there a better solution to this that doesn't create this data duplication in the first time ?
Unless you're doing something very trivial to the CSV, and only doing it once, SQLite will be faster for runtime, coding time, and maintenance time, and it will be more flexible. The major advantages of putting the CSV into SQLite are... Query with a known query language. Query with a flexible query language.
Importing files as CSV or other formats Use the ". import" command to import CSV (comma separated value) or similarly delimited data into an SQLite table. The ". import" command takes two arguments which are the source from which data is to be read and the name of the ...
once command is used to export data to a CSV file followed by the file path/name where you want to write the file.
Solution: using RSQLite
without going through sqldf
:
library(RSQLite)
con <- dbConnect("SQLite", dbname = "sample_db")
# read csv file into sql database
dbWriteTable(con, name="sample_data", value="sample_data.csv",
row.names=FALSE, header=TRUE, sep = ",")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With