I am trying to write a scanner in Go that scans continuation lines and also clean the line up before returning it so that you can return logical lines. So, given the following SplitLine function (Play):
func ScanLogicalLines(data []byte, atEOF bool) (int, []byte, error) {
if atEOF && len(data) == 0 {
return 0, nil, nil
}
i := bytes.IndexByte(data, '\n')
for i > 0 && data[i-1] == '\\' {
fmt.Printf("i: %d, data[i] = %q\n", i, data[i])
i = i + bytes.IndexByte(data[i+1:], '\n')
}
var match []byte = nil
advance := 0
switch {
case i >= 0:
advance, match = i + 1, data[0:i]
case atEOF:
advance, match = len(data), data
}
token := bytes.Replace(match, []byte("\\\n"), []byte(""), -1)
return advance, token, nil
}
func main() {
simple := `
Just a test.
See what is returned. \
when you have empty lines.
Followed by a newline.
`
scanner := bufio.NewScanner(strings.NewReader(simple))
scanner.Split(ScanLogicalLines)
for scanner.Scan() {
fmt.Printf("line: %q\n", scanner.Text())
}
}
I expected the code to return something like:
line: "Just a test."
line: ""
line: "See what is returned, when you have empty lines."
line: ""
line: "Followed by a newline."
However, it stops after returning the first line. The second call return 1, "", nil
.
Anybody have any ideas, or is it a bug?
I would regard this as a bug because an advance value > 0 is not intended to make a further read call, even when the returned token is nil (bufio.SplitFunc):
If the data does not yet hold a complete token, for instance if it has no newline while scanning lines, SplitFunc can return (0, nil) to signal the Scanner to read more data into the slice and try again with a longer slice starting at the same point in the input.
The input buffer of the bufio.Scanner
defaults to 4096 byte. That means that it reads up to this
amount at once if it can and then executes the split function. In your case the scanner can read your input all at once as it is well below 4096 byte. This means that the next read it will do results in EOF
which is the main problem here.
scanner.Scan
reads all your datanil
as a token by removing the newline from the matchscanner.Scan
assumes: user needs more datascanner.Scan
attempts to read moreEOF
happensscanner.Scan
tries to tokenize one last time"Just a test."
scanner.Scan
tries to tokenize one last timenil
as a token by removing the newline from the matchscanner.Scan
sees nil
token and set error (EOF
)Any token that is non-nil will prevent this. As long as you return non-nil tokens the
scanner will not check for EOF
and continues executing your tokenizer.
The reason why your code returns nil
tokens is that bytes.Replace
returns
nil
when there's nothing to be done. append([]byte(nil), nil...) == nil
.
You could prevent this by returning a slice with a capacity and no elements as
this would be non-nil: make([]byte, 0, 1) != nil
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With