Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mmap and csv files

Tags:

r

mmap

I am trying to understand how to use the package mmap to access large csv files. More precisely, I'd like to

  1. Create a mmap object from a csv file with mmap.csv();
  2. Save the file created by mmap.csv() containing the data in binary format;
  3. Be able to "map the binary data back to R" using the function mmap().

Achieving 1. and 2. is easy: just use mmap.cv() and save the tempfile() that contains the binary data, or modify mmap.cv() to accept an extra parameter as output file (and modify the line tmpstruct <- tempfile() accordingly). What I am having trouble with is 3. In particular, I need to construct a C-struct for the records in the binary data from the mmap object. Here is a simple reproducible example:

# create mmap object with its file
library(mmap)
data(cars)

m <- as.mmap(cars, file="cars.Rmap")
colnames(m) <- colnames(cars)
str(m) 
munmap(m)

The information from str() can be used to construct the C-struct record.struct that allows mapping the binary file cars.Rmap via the function mmap.

> str(m)
<mmap:temp.Rmap>  (struct) struct [1:50, 1:2] 4 ...
  data         :<externalptr> 
  bytes        : num 400
  filedesc     : Named int 27
 - attr(*, "names")= chr "temp.Rmap"
  storage.mode :List of 2
 $ speed:Classes 'Ctype', 'int'  atomic (0) 
  .. ..- attr(*, "bytes")= int 4
  .. ..- attr(*, "signed")= int 1
 $ dist :Classes 'Ctype', 'int'  atomic (0) 
  .. ..- attr(*, "bytes")= int 4
  .. ..- attr(*, "signed")= int 1
 - attr(*, "bytes")= int 8
 - attr(*, "offset")= int [1:2] 0 4
 - attr(*, "signed")= logi NA
 - attr(*, "class")= chr [1:2] "Ctype" "struct"
  pagesize     : num 4096
  dim          :NULL

In this case, we need two 4-byte integers:

# load from disk
record.struct <- struct(speed = integer(),  # int32(), 4 byte int
                        dist  = integer()   # int32(), 4 byte int
                        )
m <- mmap("temp.Rmap", mode=record.struct)

Inferring the right C-struct can be very impractical for "wide" csv files (i.e. files with tens or hundreds of columns). Here is my question: How can one construct record.struct directly from the mmap object m?

like image 739
Ryogi Avatar asked Nov 04 '11 05:11

Ryogi


2 Answers

A more or less complete example of what you are asking - using mmap and mmap.csv

data(mtcars)
tmp <- tempfile()
write.csv(mtcars, tmp)
m <- mmap.csv(tmp)   # mmap in the csv
head(m)
                    X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 Mazda RX4           21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag       21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 Datsun 710          22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 Hornet 4 Drive      21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 Hornet Sportabout   18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 Valiant             18.1   6  225 105 2.76 3.460 20.22  1  0    3    1


st <- m$storage.mode

## since m is already mmap'd as a binary, we'll use that here - but you'd store this
m1 <- mmap(attr(m$filedesc, "names"), mode=st, extractFUN=as.data.frame)

head(m1)
                    X  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 Mazda RX4           21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag       21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3 Datsun 710          22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
4 Hornet 4 Drive      21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
5 Hornet Sportabout   18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 Valiant             18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

As a previous answer mentions, m$storage.mode is the mode you are needing.

You could go one step further and store the mode in a file using some naming convention of your devising. You could also create a custom binary object utilizing the len and off args to mmap.

like image 134
Jeff R Avatar answered Nov 11 '22 23:11

Jeff R


This should work:

varClasses <- rapply(m$storage.mode, typeof)

Here's what I get:

> rapply(m$storage.mode, typeof)
    speed     dist
 "double" "double" 

(This is due to cars being stored as doubles in my version of R. Results match yours when the type is changed to integers - see Update 1, below.)

Using this to create the struct object is simply a matter of replacing these types with the appropriate C types (e.g. changing int to integer), which can be done via a list lookup, and then you could use paste to create the appropriate list of arguments.


Here's what m looks like for me, using the same commands as you gave:

> str(m)
<mmap:/tmp/Rtmpz...>  (struct) struct [1:50, 1:2] 4 ...
  data         :<externalptr> 
  bytes        : num 800
  filedesc     : Named int 3
 - attr(*, "names")= chr "/tmp/RtmpzGwIDT/file77aa9d47"
  storage.mode :List of 2
 $ speed:Classes 'Ctype', 'double'  atomic (0) 
  .. ..- attr(*, "bytes")= int 8
  .. ..- attr(*, "signed")= int 1
 $ dist :Classes 'Ctype', 'double'  atomic (0) 
  .. ..- attr(*, "bytes")= int 8
  .. ..- attr(*, "signed")= int 1
 - attr(*, "bytes")= int 16
 - attr(*, "offset")= int [1:2] 0 8
 - attr(*, "signed")= logi NA
 - attr(*, "class")= chr [1:2] "Ctype" "struct"
  pagesize     : num 4096
  dim          :NULL

Update 1: When I explicitly converted cars to integers, and ensured the object was a data frame (i.e. cars2 <- as.data.frame(apply(cars, 2, as.integer)); colnames(cars2) = colnames(cars)), everything works out, and the rapply produces "integer", as expected.

Update 2: Here's hack at creating the internal arguments to pass to struct():

oTypes  = rapply(m$storage.mode, typeof)
lNames  = names(oTypes)
lTypes  = as.character(oTypes)
lTypes  = paste(lTypes,'()', sep = "")
lArgs   = paste(lNames, lTypes, sep = "=", collapse = ",")

It's an approximation, because I suspect that lTypes needs to be converted from R to C types.

like image 26
Iterator Avatar answered Nov 11 '22 23:11

Iterator