I am trying to understand how to use the package mmap
to access large csv files. More precisely, I'd like to
mmap
object from a csv
file with mmap.csv()
;mmap.csv()
containing the data in binary format;mmap()
.Achieving 1. and 2. is easy: just use mmap.cv()
and save the tempfile()
that contains the binary data, or modify mmap.cv()
to accept an extra parameter
as output file (and modify the line tmpstruct <- tempfile()
accordingly).
What I am having trouble with is 3. In particular, I need to construct a
C-struct for the records in the binary data from the mmap
object.
Here is a simple reproducible example:
# create mmap object with its file
library(mmap)
data(cars)
m <- as.mmap(cars, file="cars.Rmap")
colnames(m) <- colnames(cars)
str(m)
munmap(m)
The information from str()
can be used to construct the C-struct
record.struct
that allows mapping the binary file cars.Rmap
via the function mmap.
> str(m)
<mmap:temp.Rmap> (struct) struct [1:50, 1:2] 4 ...
data :<externalptr>
bytes : num 400
filedesc : Named int 27
- attr(*, "names")= chr "temp.Rmap"
storage.mode :List of 2
$ speed:Classes 'Ctype', 'int' atomic (0)
.. ..- attr(*, "bytes")= int 4
.. ..- attr(*, "signed")= int 1
$ dist :Classes 'Ctype', 'int' atomic (0)
.. ..- attr(*, "bytes")= int 4
.. ..- attr(*, "signed")= int 1
- attr(*, "bytes")= int 8
- attr(*, "offset")= int [1:2] 0 4
- attr(*, "signed")= logi NA
- attr(*, "class")= chr [1:2] "Ctype" "struct"
pagesize : num 4096
dim :NULL
In this case, we need two 4-byte integers:
# load from disk
record.struct <- struct(speed = integer(), # int32(), 4 byte int
dist = integer() # int32(), 4 byte int
)
m <- mmap("temp.Rmap", mode=record.struct)
Inferring the right C-struct can be very impractical for "wide" csv files (i.e. files with tens or hundreds of columns). Here is my question:
How can one construct record.struct
directly
from the mmap object m
?
A more or less complete example of what you are asking - using mmap and mmap.csv
data(mtcars)
tmp <- tempfile()
write.csv(mtcars, tmp)
m <- mmap.csv(tmp) # mmap in the csv
head(m)
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
st <- m$storage.mode
## since m is already mmap'd as a binary, we'll use that here - but you'd store this
m1 <- mmap(attr(m$filedesc, "names"), mode=st, extractFUN=as.data.frame)
head(m1)
X mpg cyl disp hp drat wt qsec vs am gear carb
1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
As a previous answer mentions, m$storage.mode is the mode you are needing.
You could go one step further and store the mode in a file using some naming convention of your devising. You could also create a custom binary object utilizing the len and off args to mmap.
This should work:
varClasses <- rapply(m$storage.mode, typeof)
Here's what I get:
> rapply(m$storage.mode, typeof)
speed dist
"double" "double"
(This is due to cars
being stored as doubles in my version of R. Results match yours when the type is changed to integers - see Update 1, below.)
Using this to create the struct
object is simply a matter of replacing these types with the appropriate C types (e.g. changing int
to integer
), which can be done via a list lookup, and then you could use paste
to create the appropriate list of arguments.
Here's what m
looks like for me, using the same commands as you gave:
> str(m)
<mmap:/tmp/Rtmpz...> (struct) struct [1:50, 1:2] 4 ...
data :<externalptr>
bytes : num 800
filedesc : Named int 3
- attr(*, "names")= chr "/tmp/RtmpzGwIDT/file77aa9d47"
storage.mode :List of 2
$ speed:Classes 'Ctype', 'double' atomic (0)
.. ..- attr(*, "bytes")= int 8
.. ..- attr(*, "signed")= int 1
$ dist :Classes 'Ctype', 'double' atomic (0)
.. ..- attr(*, "bytes")= int 8
.. ..- attr(*, "signed")= int 1
- attr(*, "bytes")= int 16
- attr(*, "offset")= int [1:2] 0 8
- attr(*, "signed")= logi NA
- attr(*, "class")= chr [1:2] "Ctype" "struct"
pagesize : num 4096
dim :NULL
Update 1: When I explicitly converted cars
to integers, and ensured the object was a data frame (i.e. cars2 <- as.data.frame(apply(cars, 2, as.integer)); colnames(cars2) = colnames(cars)
), everything works out, and the rapply
produces "integer"
, as expected.
Update 2: Here's hack at creating the internal arguments to pass to struct()
:
oTypes = rapply(m$storage.mode, typeof)
lNames = names(oTypes)
lTypes = as.character(oTypes)
lTypes = paste(lTypes,'()', sep = "")
lArgs = paste(lNames, lTypes, sep = "=", collapse = ",")
It's an approximation, because I suspect that lTypes
needs to be converted from R to C types.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With