Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you preallocate space for a String in Julia?

The Julia documentation describes pre-allocating memory for an Array to improve performance by avoiding garbage collection. Is this possible to do with a String which is a Vector, after all? String source code is here.

My use case is that I am processing large text dumps, using readuntil() to get a chunk, then doing regex match() or matchall() or replace(). I already coded it in Perl, but want to see if Julia can be faster. I already know the length of the longest string I will have to process.

fs=open(fn,"r")
    while !eof(fs)
        text = readuntil(fs, "</tag>")
        text = match(r"pattern"s, text).match
        text = replace(text, r"badpattern", "goodpattern")
        text = replace(text, r"anotherbadpattern", "betterpattern")
        ... (dozens more replacements)
    end
close(fs)

I expect disk I/O to be the main bottleneck, but am interested in learning about anything that will help. I welcome any suggestions on possible methods to speed up the process.

like image 321
ultradian Avatar asked Mar 19 '17 20:03

ultradian


2 Answers

Strings in Julia are immutable therefore the concept of pre-alocation does not work.

julia> a = "aaaa";
julia> pointer(a)
   Ptr{UInt8} @0x0000000119628f50

julia> a = "bbbb";
julia> pointer(a)
   Ptr{UInt8} @0x000000011963a030
like image 185
slowbrain Avatar answered Oct 23 '22 17:10

slowbrain


In Julia 0.6 it is easier to attempt an even more optimized version. Specifically, we can have an IOBuffer and a String share the same memory. The replace function writes to an IOBuffer and processes a String, so this gives us the opportunity to do replace! without any allocations.

A series of replace! statements can alternate between two pre-allocated buffers and (almost) avoid any allocation.

More concretely:

# version 0.6.0-pre.alpha.220
function replace!(out::IOBuffer, str::AbstractString, 
  pattern, repl, limit::Integer)
    n = 1
    e = endof(str)
    i = a = start(str)
    r = search(str,pattern,i)
    j, k = first(r), last(r)
    out.size = 0
    out.ptr = 1
    while j != 0
        if i == a || i <= k
            Base.unsafe_write(out, pointer(str, i), UInt(j-i))
            Base._replace(out, repl, str, r, pattern)
        end
        if k<j
            i = j
            k = nextind(str, j)
        else
            i = k = nextind(str, k)
        end
        if j > e
            break
        end
        r = search(str,pattern,k)
        j, k = first(r), last(r)
        n == limit && break
        n += 1
    end
    write(out, SubString(str,i))
  return out.ptr-1
end

And to drive this function and pre-allocate buffers we do:

const string1 = "."^1000
const string2 = "."^1000

const io1 = IOBuffer(convert(Vector{UInt8},string1),true,true,length(string1))
const io2 = IOBuffer(convert(Vector{UInt8},string2),true,true,length(string2))

write(io1,"hello world. "^30)   # test string to manipulate

function switcheroo!()
    replace!(io2,SubString(string1,1,io1.ptr-1),"hello","what's up",0)
    replace!(io1,SubString(string2,1,io2.ptr-1),"what's up","hello",0)
end

switcheroo!()

The function switcheroo! replaces hello with whats up and back again. It is simple to modify the code to handle a series of replace as the question suggests is required.

Notes:

  1. This is more closely coupled with Julia String internals and thus may not be stable going forward to future Julia versions (but just as efficient alternatives will probably remain).

  2. There are still two (pesky!) allocations per replace! for the two SubString values created. These allocations are 64 bytes and do not grow with string size.

like image 3
Dan Getz Avatar answered Oct 23 '22 17:10

Dan Getz