I have a data.table
that is not very big (2 GB) but for some reason write.csv
takes an extremely long time to write it out (I've never actually finished waiting) and seems to use a ton of RAM to do it.
I tried converting the data.table
to a data.frame
although this shouldn't really do anything since data.table
extends data.frame
. has anyone run into this?
More importantly, if you stop it with Ctrl-C, R does not seem to give memory back.
table package comes with a function called fread which is a very efficient and speedy function for reading data from files. It is similar to read.
The data. table package provides a faster implementation of the merge() function. The syntax is pretty much the same as base R's merge() .
Its fread() function is meant to import data from regular delimited files directly into R, without any detours or nonsense. Note that “regular” in this case means that every row of your data needs to have the same number of columns.
Data. table is an extension of data. frame package in R. It is widely used for fast aggregation of large datasets, low latency add/update/remove of columns, quicker ordered joins, and a fast file reader.
UPDATE 2019.01.07:
fwrite
has been on CRAN since 2016-11-25.
install.packages("data.table")
UPDATE 08.04.2016:
fwrite
has been recently added to the data.table package's development version. It also runs in parallel (implicitly).
# Install development version of data.table
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table", type = "source")
# Load package
library(data.table)
# Load data
data(USArrests)
# Write CSV
fwrite(USArrests, "USArrests_fwrite.csv")
According to the detailed benchmark tests shown under speeding up the performance of write.table, fwrite
is ~17x faster than write.csv
there (YMMV).
UPDATE 15.12.2015:
In the future there might be a fwrite
function in the data.table
package, see: https://github.com/Rdatatable/data.table/issues/580.
In this thread a GIST is linked, which provides a prototype for such a function speeding up the process by a factor of 2 (according to the author, https://gist.github.com/oseiskar/15c4a3fd9b6ec5856c89).
ORIGINAL ANSWER:
I had the same problems (trying to write even larger CSV files) and decided finally against using CSV files.
I would recommend you to use SQLite as it is much faster than dealing with CSV files:
require("RSQLite")
# Set up database
drv <- dbDriver("SQLite")
con <- dbConnect(drv, dbname = "test.db")
# Load example data
data(USArrests)
# Write data "USArrests" in table "USArrests" in database "test.db"
dbWriteTable(con, "arrests", USArrests)
# Test if the data was correctly stored in the database, i.e.
# run an exemplary query on the newly created database
dbGetQuery(con, "SELECT * FROM arrests WHERE Murder > 10")
# row_names Murder Assault UrbanPop Rape
# 1 Alabama 13.2 236 58 21.2
# 2 Florida 15.4 335 80 31.9
# 3 Georgia 17.4 211 60 25.8
# 4 Illinois 10.4 249 83 24.0
# 5 Louisiana 15.4 249 66 22.2
# 6 Maryland 11.3 300 67 27.8
# 7 Michigan 12.1 255 74 35.1
# 8 Mississippi 16.1 259 44 17.1
# 9 Nevada 12.2 252 81 46.0
# 10 New Mexico 11.4 285 70 32.1
# 11 New York 11.1 254 86 26.1
# 12 North Carolina 13.0 337 45 16.1
# 13 South Carolina 14.4 279 48 22.5
# 14 Tennessee 13.2 188 59 26.9
# 15 Texas 12.7 201 80 25.5
# Close the connection to the database
dbDisconnect(con)
For further information, see http://cran.r-project.org/web/packages/RSQLite/RSQLite.pdf
You can also use a software like http://sqliteadmin.orbmu2k.de/ to access the database and export the database to CSV etc.
--
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With