Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read binary file with R

Tags:

r

I am looking for help to read a binary file with R.

I know the file can be successfully imported in Python with the following code (np for numpy):

dt = np.dtype([('var1', np.uint32), ('var2', np.uint16), ('var3', np.int16), 
('var4', np.int16), ('var5', np.int16)])
data = np.fromfile('filename.DAT', dtype=dt)

I, however, don't understand how to use readBin to import this file in R. Any help would be appreciated.

like image 626
Jehol Avatar asked Sep 03 '25 14:09

Jehol


1 Answers

There may well be a pre-existing solution to this problem using the Reticulate or RcppCNPy packages. However, I thought it might be educational to show how you could do this is base R.

When you read arbitrary binary data into R using readBin, it reads the file into a "raw" vector. This is a vector of the individual bytes in the file. So you could do:

my_data <- readBin("filename.DAT", "raw", 10e6)

So it's easy to get the data into R. The difficult part is interpreting it.

As far as I can tell from the numpy docs, the data stored in your DAT should be written as a continuous block of bytes with little-endian ordering. So in your file with the specified format, you should have the first 4 bytes representing a 32-bit unsigned integer, the next two bytes showing an unsigned integer and the next 6 bytes representing 3 signed 16-bit integers. This pattern will then repeat every 12 bytes until the end of the file.

This is not a format used in R, so it takes a bit of work to get the data back. Let's say you have read in your data and it looks like this:

my_data
#  [1] 44 5f 93 e8 34 e6 f1 a9 a1 10 35 2e b0 62 c5 7f b7 fd 61 c7 ef 37 a7 21 45 63
# [27] 04 62 de 57 7b 99 7e 30 d3 ab cb 1c b9 69 d2 a6 c8 8e 88 ca 06 7a bb b1 7a dc
# [53] 70 3f 13 1a 51 85 a9 68

If you want to see what your the bytes look like in terms of the rows of data in your table, you could do this:

t(matrix(my_data, nrow = 12))
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
# [1,]   44   5f   93   e8   34   e6   f1   a9   a1    10    35    2e
# [2,]   b0   62   c5   7f   b7   fd   61   c7   ef    37    a7    21
# [3,]   45   63   04   62   de   57   7b   99   7e    30    d3    ab
# [4,]   cb   1c   b9   69   d2   a6   c8   8e   88    ca    06    7a
# [5,]   bb   b1   7a   dc   70   3f   13   1a   51    85    a9    68

What this means is that your binary data should be interpreted this way:

#  <-----var1--------> <-var2--> <-var3--> <-var4--> <-var5->
#  44   5f   93   e8  | 34   e6 | f1   a9 | a1   10 | 35   2e  <- row 1
#  b0   62   c5   7f  | b7   fd | 61   c7 | ef   37 | a7   21  <- row 2
#  45   63   04   62  | de   57 | 7b   99 | 7e   30 | d3   ab  <- row 3
#  cb   1c   b9   69  | d2   a6 | c8   8e | 88   ca | 06   7a  <- row 4
#  bb   b1   7a   dc  | 70   3f | 13   1a | 51   85 | a9   68  <- row 5

So we if we first create a data frame from this matrix:

df <- as.data.frame(t(matrix(as.numeric(my_data), nrow = 12)))

We can now recreate our variables from the known structure of the file:

# Make our 32-bit numbers
var1 <- df$V1 + 2^8 * df$V2 + 2^16 * df$V3 + 2^24 * df$V4

# Make our 16-bit numbers
var2 <- df$V5  + 2^8 * df$V6
var3 <- df$V7  + 2^8 * df$V8
var4 <- df$V9  + 2^8 * df$V10
var5 <- df$V11 + 2^8 * df$V12

# Interpret our var3, 4 and 5 as signed rather than unsigned
var3 <- ifelse(var3 < 2^15, var3, var3 - 2^16)
var4 <- ifelse(var4 < 2^15, var4, var4 - 2^16)
var5 <- ifelse(var5 < 2^15, var5, var5 - 2^16)

# Store as a data frame
df <- data.frame(var1 = var1, var2 = var2, var3 = var3, var4 = var4, var5 = var5)

This means we get the following interpretation of our byte data:

df
#>         var1  var2   var3   var4   var5
#> 1 3901972292 58932 -22031   4257  11829
#> 2 2143642288 64951 -14495  14319   8615
#> 3 1644454725 22494 -26245  12414 -21549
#> 4 1773739211 42706 -28984 -13688  31238
#> 5 3699028411 16240   6675 -31407  26793

So, assuming your data is in EXACTLY the format you specified, the following function should extract it as a data frame:

read_numpy_data <- function(path, max_file_size = 10e6)
{
  my_data <- readBin(path, "raw", max_file_size)
  df      <- as.data.frame(t(matrix(as.numeric(my_data), nrow = 12)))
  as_sign <- function(x, y) {(x + 2^8 * y) -> z; ifelse(z < 2^15, z, z - 2^16)}
  data.frame(var1 = df$V1 + 2^8 * df$V2 + 2^16 * df$V3 + 2^24 * df$V4,
             var2 = df$V5  + 2^8 * df$V6,
             var3 = as_sign(df$V7,  df$V8),
             var4 = as_sign(df$V9,  df$V10),
             var5 = as_sign(df$V11, df$V12))
}
like image 73
Allan Cameron Avatar answered Sep 05 '25 04:09

Allan Cameron