Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Store Text Files as Binary for Faster Read/Write

I have a large set of text files that I need to process. The performance is pretty good now, thanks to pmap(), but i'm looking for an additional speed-up. The current bottleneck is parsing strings into floats.

I had the thought of loading my data (pipe delimited text files) and writing them to a binary format. From what I have seen Julia should be able to load my data faster this way. The problem is that I'm having some trouble with the proper way to load my binary data back into Julia after writing it as binary.

Here's some example code I have for loading, parsing and writing to binary:

input_file = "/cool/input/file.dat"   ## -- pipe delimited text file of Floats
output_file = "/cool/input/data_binary" 

open(input_file, "r") do in_file
    open(output_file, "w") do out_file
        for line in eachline(in_file)
            split_line = split(line, '|')
            out_float = parse(Float64, split_line[4])
            write(out_file, out_float)
        end
    end
end

The problem is that when I load the above file into Julia, I have no idea what the values are:

read(output_file)
n-element Array{UInt8,1}:
 0x00
 0x00
 0x00
 0x00
 0x00
 0x80
 0x16

How can I use these binary values as floats in my Julia code? More generally, does it make sense to convert my text file data to binary in this way, if i'm looking for a performance increase?

like image 776
Jeremy McNees Avatar asked Dec 25 '22 02:12

Jeremy McNees


1 Answers

You need to use the reinterpret function:

help?> reinterpret
search: reinterpret

  reinterpret(type, A)

  Change the type-interpretation of a block of memory. For example,
  reinterpret(Float32, UInt32(7)) interprets the 4 bytes corresponding
  to UInt32(7) as a Float32. For arrays, this constructs an array with 
  the same binary data as the given array, but with the specified element
  type.

Function to write numeric data:

julia> function write_data{T<:Number}(file_name::String, data::AbstractArray{T})
           open(file_name, "w") do f_out
               for i in data
                   write(f_out, i)
               end
           end
       end
write_data (generic function with 1 method)

Random data:

julia> data = rand(10)
10-element Array{Float64,1}:
 0.986948
 0.616107
 0.504965
 0.673264
 0.0358904
 0.1795
 0.399481
 0.233351
 0.320968
 0.16746

Function that reads binary data and reinterprets it to a numeric data type:

julia> function read_data{T<:Number}(file_name::String, dtype::Type{T})
           open(file_name, "r") do f_in
               reinterpret(dtype, read(f_in))
           end
       end
read_data (generic function with 1 method)

Reading the sample data as Float64s yields the same array we written:

julia> read_data("foo.bin", Float64)
10-element Array{Float64,1}:
 0.986948
 0.616107
 0.504965
 0.673264
 0.0358904
 0.1795
 0.399481
 0.233351
 0.320968
 0.16746

Reinterpreting as Float32, naturally yields twice as much data:

julia> read_data("foo.bin", Float32)
20-element Array{Float32,1}:
  1.4035f7
  1.87174
 -9.17366f25
  1.77903
 -1.03106f-24
  1.75124
  1.9495f-20
  1.79332
  2.88032f-21
  1.26856
  1.17736f19
  1.5545
 -3.25944f-18
  1.69974
  5.25285f-17
  1.60835
 -3.46489f14
  1.66048
  1.91915f-25
  1.54246
like image 138
HarmonicaMuse Avatar answered Dec 29 '22 23:12

HarmonicaMuse