I have a large set of text files that I need to process. The performance is pretty good now, thanks to pmap()
, but i'm looking for an additional speed-up. The current bottleneck is parsing strings into floats.
I had the thought of loading my data (pipe delimited text files) and writing them to a binary format. From what I have seen Julia should be able to load my data faster this way. The problem is that I'm having some trouble with the proper way to load my binary data back into Julia after writing it as binary.
Here's some example code I have for loading, parsing and writing to binary:
input_file = "/cool/input/file.dat" ## -- pipe delimited text file of Floats
output_file = "/cool/input/data_binary"
open(input_file, "r") do in_file
open(output_file, "w") do out_file
for line in eachline(in_file)
split_line = split(line, '|')
out_float = parse(Float64, split_line[4])
write(out_file, out_float)
end
end
end
The problem is that when I load the above file into Julia, I have no idea what the values are:
read(output_file)
n-element Array{UInt8,1}:
0x00
0x00
0x00
0x00
0x00
0x80
0x16
How can I use these binary values as floats in my Julia code? More generally, does it make sense to convert my text file data to binary in this way, if i'm looking for a performance increase?
You need to use the reinterpret
function:
help?> reinterpret
search: reinterpret
reinterpret(type, A)
Change the type-interpretation of a block of memory. For example,
reinterpret(Float32, UInt32(7)) interprets the 4 bytes corresponding
to UInt32(7) as a Float32. For arrays, this constructs an array with
the same binary data as the given array, but with the specified element
type.
Function to write numeric data:
julia> function write_data{T<:Number}(file_name::String, data::AbstractArray{T})
open(file_name, "w") do f_out
for i in data
write(f_out, i)
end
end
end
write_data (generic function with 1 method)
Random data:
julia> data = rand(10)
10-element Array{Float64,1}:
0.986948
0.616107
0.504965
0.673264
0.0358904
0.1795
0.399481
0.233351
0.320968
0.16746
Function that reads binary data and reinterprets it to a numeric data type:
julia> function read_data{T<:Number}(file_name::String, dtype::Type{T})
open(file_name, "r") do f_in
reinterpret(dtype, read(f_in))
end
end
read_data (generic function with 1 method)
Reading the sample data as Float64
s yields the same array we written:
julia> read_data("foo.bin", Float64)
10-element Array{Float64,1}:
0.986948
0.616107
0.504965
0.673264
0.0358904
0.1795
0.399481
0.233351
0.320968
0.16746
Reinterpreting as Float32
, naturally yields twice as much data:
julia> read_data("foo.bin", Float32)
20-element Array{Float32,1}:
1.4035f7
1.87174
-9.17366f25
1.77903
-1.03106f-24
1.75124
1.9495f-20
1.79332
2.88032f-21
1.26856
1.17736f19
1.5545
-3.25944f-18
1.69974
5.25285f-17
1.60835
-3.46489f14
1.66048
1.91915f-25
1.54246
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With