Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient appending to a variable-length container of strings (Golang)

The problem:

I need to apply multiple regexes to each line of a big log file (like several GB long) , gather non-empty matches and put them all in an array (for serialization and sending it over the network).

Slices are not much help if answer to this question holds:

If the slice does not have sufficient capacity, append will need to allocate new memory and copy the old one over. For slices with <1024 elements, it will double the capacity, for slices with >1024 elements it will increase it by factor 1.25.

Since there can be literally hundreds of thousands of regex matches, I can't really predict the length / capacity of a slice. I can't make it too big either "just in case" bc this will waste memory (or will it? if memory allocator is smart enough not to allocate too much memory that is not written into, maybe I could use a huge slice capacity without much harm?).

So I'm thinking about following alternative:

  1. have a doubly-linked list of matches (http://golang.org/pkg/container/list/)
  2. calc its length (will len() work?)
  3. preallocate a slice of this capacity
  4. copy string pointers to this slice

Is there a less laborious way of achieving this goal in Go (append with ~ O(1) append complexity)?

(golang newbie here of course)

like image 549
LetMeSOThat4U Avatar asked Nov 27 '13 19:11

LetMeSOThat4U


2 Answers

append()'s average (amortized) cost is already O(1) because it grows the array by a percentage each time. As the array gets larger, growing it gets costlier but proportionately rarer. A 10M-item slice will be 10x more expensive to grow than a 1M-item slice, but since the extra capacity we're allocating is proportional to the size, it'll also be 10x as many append(slice, item) calls until the next time it grows. The increasing cost and decreasing frequency of reallocations cancel out, leaving the average cost constant, i.e., O(1).

The same idea applies other languages' dynamically-sized arrays, too: Microsoft's std::vector implementation apparently grows the array by 50% each time, for example. Amortized O(1) doesn't mean you pay nothing for allocations, only that you continue paying at the same average rate as your array gets bigger.

On my laptop, I could run a million slice = append(slice, someStaticString)s in 77ms. One reason it's quick, which siritinga noted below, is that "copying" the string to enlarge the array is really just copying the string header (a pointer/length pair), not copying the contents. 100,000 string headers is still under 2MB to copy, not a big deal compared to the other quantities of data you're working with.

container/list turned out ~3x slower for me in a microbenchmark; linked-list append is also constant time, of course, but I imagine append has a lower constant because it can usually just write to a couple words of memory and not allocate a list item, etc. The timing code won't work in the Playground but you can copy this locally and run it to see yourself: http://play.golang.org/p/uYyMScmOjX

Sometimes, you can usefully pre-allocate space to avoid reallocations/copies (in this example, using make([]string, 0, 1000000) takes the runtime from ~77ms to ~10ms), but, of course, often-to-usually just you don't have enough info about the expected data size and so on to eke out worthwhile gains, and you're better off leaving it to the built-in algorithm.


But you're asking a more specific question here about a grep-like application (and thanks for asking a detailed question with context). For that, bottom-line recommendation is that if you're searching gigs of logs, it's probably best to avoid buffering the whole output in RAM at all.

You could write something to stream the results as a single function: logparser.Grep(in io.Reader, out io.Writer, patterns []regexp.Regexp); you could alternatively make out a chan []byte or func(match []byte) (err error) if you don't want the code that sends results to be too enmeshed with the grep code.

(On []byte vs. string: a []byte seems to do the job here and avoids []byte<=>string conversions when you do I/O, so I'd prefer that. I don't know what all you're doing, though, and if you need string it's fine.)

If you do keep the whole match list in RAM, be aware that keeping around a reference to part of a big string or byte slice keeps the whole source string/slice from being garbage collected. So if you go that route, then counterintuitively you may actually want to copy matches to avoid keeping all of the source log data in RAM.

like image 109
twotwotwo Avatar answered Nov 10 '22 02:11

twotwotwo


I tried to distill your question into a very simple example.

Since there can be "hundreds of thousands of regex matches", I did a large initial allocation of 1 M (1024 * 1024) entries for the matches slice capacity. A slice is a reference type. A slice header 'struct' has length, a capacity, and a pointer for a total of 24 (3 * 8) bytes on a 64-bit OS. The initial allocation for a slice of 1 M entries is therefore only 24 (24 * 1) MB. If there are more than 1 M entries, a new slice with capacity of 1.25 (1 + 1 / 4) M entries will be allocated and the existing 1 M slice header entries (24 MB) will be copied to it.

In summary, you can avoid much of the the overhead of many appends by initially over allocating the slice capacity. The bigger memory problem is all the data that is saved and referenced for each match. The far bigger CPU time problem is the time taken to perform the regexp.FindAll's.

package main

import (
    "bufio"
    "fmt"
    "os"
    "regexp"
)

var searches = []*regexp.Regexp{
    regexp.MustCompile("configure"),
    regexp.MustCompile("unknown"),
    regexp.MustCompile("PATH"),
}

var matches = make([][]byte, 0, 1024*1024)

func main() {
    logName := "config.log"
    log, err := os.Open(logName)
    if err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
    defer log.Close()
    scanner := bufio.NewScanner(log)
    for scanner.Scan() {
        line := scanner.Bytes()
        for _, s := range searches {
            for _, m := range s.FindAll(line, -1) {
                matches = append(matches, append([]byte(nil), m...))
            }
        }
    }
    if err := scanner.Err(); err != nil {
        fmt.Fprintln(os.Stderr, err)
    }
    // Output matches
    fmt.Println(len(matches))
    for i, m := range matches {
        fmt.Println(string(m))
        if i >= 16 {
            break
        }
    }
}
like image 3
peterSO Avatar answered Nov 10 '22 04:11

peterSO