Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scanner terminating early

Tags:

go

I am trying to write a scanner in Go that scans continuation lines and also clean the line up before returning it so that you can return logical lines. So, given the following SplitLine function (Play):

func ScanLogicalLines(data []byte, atEOF bool) (int, []byte, error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }

    i := bytes.IndexByte(data, '\n')
    for i > 0 && data[i-1] == '\\' {
        fmt.Printf("i: %d, data[i] = %q\n", i, data[i])
        i = i + bytes.IndexByte(data[i+1:], '\n')
    }

    var match []byte = nil
    advance := 0
    switch {
    case i >= 0:
        advance, match = i + 1, data[0:i]
    case atEOF: 
        advance, match = len(data), data
    }
    token := bytes.Replace(match, []byte("\\\n"), []byte(""), -1)
    return advance, token, nil
}

func main() {
    simple := `
Just a test.

See what is returned. \
when you have empty lines.

Followed by a newline.
`

    scanner := bufio.NewScanner(strings.NewReader(simple))
    scanner.Split(ScanLogicalLines)
    for scanner.Scan() {
        fmt.Printf("line: %q\n", scanner.Text())
    }
}

I expected the code to return something like:

line: "Just a test."
line: ""
line: "See what is returned, when you have empty lines."
line: ""
line: "Followed by a newline."

However, it stops after returning the first line. The second call return 1, "", nil.

Anybody have any ideas, or is it a bug?

like image 382
Mats Kindahl Avatar asked Nov 12 '13 20:11

Mats Kindahl


1 Answers

I would regard this as a bug because an advance value > 0 is not intended to make a further read call, even when the returned token is nil (bufio.SplitFunc):

If the data does not yet hold a complete token, for instance if it has no newline while scanning lines, SplitFunc can return (0, nil) to signal the Scanner to read more data into the slice and try again with a longer slice starting at the same point in the input.

What happens is this

The input buffer of the bufio.Scanner defaults to 4096 byte. That means that it reads up to this amount at once if it can and then executes the split function. In your case the scanner can read your input all at once as it is well below 4096 byte. This means that the next read it will do results in EOF which is the main problem here.

Step by step

  1. scanner.Scan reads all your data
  2. You get all the text that is there
  3. You look for a token, you find the first newline which is only one newline
  4. You return nil as a token by removing the newline from the match
  5. scanner.Scan assumes: user needs more data
  6. scanner.Scan attempts to read more
  7. EOF happens
  8. scanner.Scan tries to tokenize one last time
  9. You find "Just a test."
  10. scanner.Scan tries to tokenize one last time
  11. You look for a token, you find the third line which is only one newline
  12. You return nil as a token by removing the newline from the match
  13. scanner.Scan sees nil token and set error (EOF)
  14. Execution ends

How to circumvent

Any token that is non-nil will prevent this. As long as you return non-nil tokens the scanner will not check for EOF and continues executing your tokenizer.

The reason why your code returns nil tokens is that bytes.Replace returns nil when there's nothing to be done. append([]byte(nil), nil...) == nil. You could prevent this by returning a slice with a capacity and no elements as this would be non-nil: make([]byte, 0, 1) != nil.

like image 113
nemo Avatar answered Oct 22 '22 07:10

nemo