The Julia documentation describes pre-allocating memory for an Array
to improve performance by avoiding garbage collection. Is this possible to do with a String
which is a Vector, after all? String
source code is here.
My use case is that I am processing large text dumps, using readuntil()
to get a chunk, then doing regex match()
or matchall()
or replace()
. I already coded it in Perl, but want to see if Julia can be faster. I already know the length of the longest string I will have to process.
fs=open(fn,"r")
while !eof(fs)
text = readuntil(fs, "</tag>")
text = match(r"pattern"s, text).match
text = replace(text, r"badpattern", "goodpattern")
text = replace(text, r"anotherbadpattern", "betterpattern")
... (dozens more replacements)
end
close(fs)
I expect disk I/O to be the main bottleneck, but am interested in learning about anything that will help. I welcome any suggestions on possible methods to speed up the process.
Strings in Julia are immutable therefore the concept of pre-alocation does not work.
julia> a = "aaaa";
julia> pointer(a)
Ptr{UInt8} @0x0000000119628f50
julia> a = "bbbb";
julia> pointer(a)
Ptr{UInt8} @0x000000011963a030
In Julia 0.6 it is easier to attempt an even more optimized version. Specifically, we can have an IOBuffer and a String share the same memory. The replace
function writes to an IOBuffer and processes a String, so this gives us the opportunity to do replace!
without any allocations.
A series of replace!
statements can alternate between two pre-allocated buffers and (almost) avoid any allocation.
More concretely:
# version 0.6.0-pre.alpha.220
function replace!(out::IOBuffer, str::AbstractString,
pattern, repl, limit::Integer)
n = 1
e = endof(str)
i = a = start(str)
r = search(str,pattern,i)
j, k = first(r), last(r)
out.size = 0
out.ptr = 1
while j != 0
if i == a || i <= k
Base.unsafe_write(out, pointer(str, i), UInt(j-i))
Base._replace(out, repl, str, r, pattern)
end
if k<j
i = j
k = nextind(str, j)
else
i = k = nextind(str, k)
end
if j > e
break
end
r = search(str,pattern,k)
j, k = first(r), last(r)
n == limit && break
n += 1
end
write(out, SubString(str,i))
return out.ptr-1
end
And to drive this function and pre-allocate buffers we do:
const string1 = "."^1000
const string2 = "."^1000
const io1 = IOBuffer(convert(Vector{UInt8},string1),true,true,length(string1))
const io2 = IOBuffer(convert(Vector{UInt8},string2),true,true,length(string2))
write(io1,"hello world. "^30) # test string to manipulate
function switcheroo!()
replace!(io2,SubString(string1,1,io1.ptr-1),"hello","what's up",0)
replace!(io1,SubString(string2,1,io2.ptr-1),"what's up","hello",0)
end
switcheroo!()
The function switcheroo!
replaces hello
with whats up
and back again. It is simple to modify the code to handle a series of replace
as the question suggests is required.
Notes:
This is more closely coupled with Julia String internals and thus may not be stable going forward to future Julia versions (but just as efficient alternatives will probably remain).
There are still two (pesky!) allocations per replace!
for the two SubString values created. These allocations are 64 bytes and do not grow with string size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With