How to read part of the data from very large files?
The sample data is generated as:
set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))
head(df)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X1.1 X2.1 X3.1 X4.1 X5.1 X6.1 X7.1 X8.1 X9.1 X10.1
# 1 575 1843 1854 883 592 1362 1075 210 1526 1365 Qk8NP Xvw9z OYRa1 8BGIV bejiv CCoIE XDKJN HR7zc 2kKNY 1I5h8
# 2 1577 390 1861 912 277 636 758 1461 1978 1865 ZaHFl QLsli E7lbs YGq8u DgUAW c6JQ0 RAZFn Sc0Zt mif8I 3Ys6U
# 3 818 1076 147 1221 257 1115 759 1959 1088 1292 jM5Uw ctM3y 0HiXR hjOHK BZDOP ULQWm Ei8qS BVneZ rkKNL 728gf
# 4 1766 884 1331 1144 1260 768 1620 1231 1428 1193 r4ZCI eCymC 19SwO Ht1O0 repPw YdlSW NRgfL RX4ta iAtVn Hzm0q
# 5 1881 1851 1324 1930 1584 1318 940 1796 830 15 w8d1B qK1b0 CeB8u SlNll DxndB vaufY ZtlEM tDa0o SEMUX V7tLQ
# 6 91 264 1563 414 914 1507 1935 1970 287 409 gsY1u FxIgu 2XqS4 8kreA ymngX h0hkK reIsn tKgQY ssR7g W3v6c
saveRDS
is used to save the file.
saveRDS(df, 'df.rds')
The file size is looked using the below commands:
file.info('df.rds')$size
# [1] 29935125
utils:::format.object_size(29935125, "auto")
# [1] "28.5 Mb"
The saved file is read using the below function.
readRDS('df.rds')
However, some of my files are in GBs
and would need only few columns for certain processing. Is it possible to read selected columns from RDS
files?
Note: I already have RDS files, generated after considerably large amounts of processing. Now, I want to know the best possible way to read selected columns from the existing RDS files.
Method 1: Using read. table() function. In this method of only importing the selected columns of the CSV file data, the user needs to call the read. table() function, which is an in-built function of R programming language, and then passes the selected column in its arguments to import particular columns from the data.
R has its own data file format–it's usually saved using the . rds extension. To read a R data file, invoke the readRDS() function. As with a CSV file, you can load a RDS file straight from a website, however, you must first run the file through a decompressor before attempting to load it via readRDS .
To pick out single or multiple columns use the select() function. The select() function expects a dataframe as it's first input ('argument', in R language), followed by the names of the columns you want to extract with a comma between each name.
I don't think you can read only a portion of an rds
or rda
file.
An alternative would be to use feather
. As an example, using a large-ish feather I'm working with:
library(feather)
file.info("../feathers/C1.feather")["size"]
# size
# ../feathers/C1.feather 498782328
system.time( c1whole <- read_feather("../feathers/C1.feather") )
# user system elapsed
# 0.860 0.856 5.540
system.time( c1dyn <- feather("../feathers/C1.feather") )
# user system elapsed
# 0 0 0
ls.objects()
# Type Size PrettySize Dim
# c1dyn feather 3232 3.2 Kb 2886147 x 36
# c1whole tbl_df 554158688 528.5 Mb 2886147 x 36
You can react with both variables as full data.frames: though c1whole
is already in memory (so may be a little faster), accessing c1dyn
is still quite speedy.
NB: some functions (e.g., several within dplyr
) do not work on feather
as they do on data.frame
or tbl_df
. If your intent is solely to pick-and-choose specific columns, then you'll be fine.
SQLite also could be a common way to store tabular/matrix/dataframe data on your hard drive using an SQLite database. This also allows the use of standard SQL commands or DPLYR to interrogate the data. Just be warned that SQLite does not have a date format so any dates need to be converted to character before writing them to the database.
set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
replicate(10, stringi::stri_rand_strings(1000, 5)))
library(RSQLite)
conn <- dbConnect(RSQLite::SQLite(), dbname="myDB")
dbWriteTable(conn,"mytable",df)
alltables <- dbListTables(conn)
# Use sql queries to query data...
oneColumn <- dbGetQuery(conn,"SELECT X1 FROM mytable")
library(dplyr)
library(dbplyr)
my_db <- tbl(conn, "mytable")
my_db
# Use dplyr functions to query data...
my_db %>% select(X1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With