Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replace() fails for large strings

Tags:

julia

I have the following code:

cd(joinpath(homedir(),"Desktop"))

using HDF5
using JLD

# read contents of a file
t = readall("sourceFile")

# remove unnecessary characters
t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "")

# convert string into Float64 array (approximately ~140 columns)
data = readdlm(IOBuffer(t), ' ', char(10))

# save array on the hard drive
save("data.jld", "data", data)

Which works fine when I test it with the sourceFile that has 10^4 or less number of lines. However when sourceFile that has around 5*10^6 lines it fails at t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "") with the following message

errormsg

like image 657
Bdar Avatar asked Sep 08 '15 21:09

Bdar


1 Answers

This question is old, and based on an older version of Julia. However, it would be useful to check if this works on a recent version. I recently tested this in latest 0.5 version of Julia, and the code above seems to work correctly with 5*10^6 lines of 600 characters each. The entire operation takes about 5G of peak memory on my laptop.

julia> t=[randstring(600) for i=1:5*10^6];

julia> writecsv("/Users/aviks/tmp/long.csv", t)

julia> t=readstring("/Users/aviks/tmp/long.csv");

julia> length(t)
3005000000

julia> @time t = replace(t, r"( 1:1\.0+)|(( 1:1\.0+)|(([1-6]:)|((\|user )|(\|))))", "");
  43.599660 seconds (137 allocations: 3.358 GB, 0.85% gc time)

(PS: Note that readall is now deprecated in favour of readstring).

like image 68
aviks Avatar answered Nov 10 '22 00:11

aviks