I have the following code:
cd(joinpath(homedir(),"Desktop"))
using HDF5
using JLD
# read contents of a file
t = readall("sourceFile")
# remove unnecessary characters
t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "")
# convert string into Float64 array (approximately ~140 columns)
data = readdlm(IOBuffer(t), ' ', char(10))
# save array on the hard drive
save("data.jld", "data", data)
Which works fine when I test it with the sourceFile that has 10^4 or less number of lines. However when sourceFile that has around 5*10^6 lines it fails at t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "")
with the following message
This question is old, and based on an older version of Julia. However, it would be useful to check if this works on a recent version. I recently tested this in latest 0.5
version of Julia, and the code above seems to work correctly with 5*10^6 lines of 600 characters each. The entire operation takes about 5G of peak memory on my laptop.
julia> t=[randstring(600) for i=1:5*10^6];
julia> writecsv("/Users/aviks/tmp/long.csv", t)
julia> t=readstring("/Users/aviks/tmp/long.csv");
julia> length(t)
3005000000
julia> @time t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "");
43.599660 seconds (137 allocations: 3.358 GB, 0.85% gc time)
(PS: Note that readall
is now deprecated in favour of readstring
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With