I am trying to parse a huge Wiktionary dump using a goroutine, and am encountering a strange bug where the channel that the goroutine is reading from seems to be losing and corrupting data every time the channel blocks.
func main() {
inFile, err := os.Open(*srcFile)
if err != nil {
log.LogErrorf("Error opening dump: %v", err)
return
}
defer inFile.Close()
var wg sync.WaitGroup
input := make(chan []byte, 51)
go func() {
wg.Add(1)
for line := range input {
log.Printf("Bytes: %s", line)
// process the line
}
wg.Done()
}()
scanner := bufio.NewScanner(inFile)
count := 0
for scanner.Scan() {
count++
log.Printf("Scanned: %d", count)
if err := scanner.Err(); err != nil {
log.LogErrorf("Error scanning: %v", err)
}
newestBytes := scanner.Bytes()
log.Printf("Bytes: %s", newestBytes)
input <- newestBytes
}
close(input)
wg.Wait()
}
When I run this, I get the correct output. In particular, note lines 51 and 52.
2014/08/03 17:49:25 Scanned: 42
2014/08/03 17:49:25 Bytes: <namespace key="115" case="case-sensitive">Citations talk</namespace>
2014/08/03 17:49:25 Scanned: 43
2014/08/03 17:49:25 Bytes: <namespace key="116" case="case-sensitive">Sign gloss</namespace>
2014/08/03 17:49:25 Scanned: 44
2014/08/03 17:49:25 Bytes: <namespace key="117" case="case-sensitive">Sign gloss talk</namespace>
2014/08/03 17:49:25 Scanned: 45
2014/08/03 17:49:25 Bytes: <namespace key="828" case="case-sensitive">Module</namespace>
2014/08/03 17:49:25 Scanned: 46
2014/08/03 17:49:25 Bytes: <namespace key="829" case="case-sensitive">Module talk</namespace>
2014/08/03 17:49:25 Scanned: 47
2014/08/03 17:49:25 Bytes: </namespaces>
2014/08/03 17:49:25 Scanned: 48
2014/08/03 17:49:25 Bytes: </siteinfo>
2014/08/03 17:49:25 Scanned: 49
2014/08/03 17:49:25 Bytes: <page>
2014/08/03 17:49:25 Scanned: 50
2014/08/03 17:49:25 Bytes: <title>Wiktionary:Welcome, newcomers</title>
2014/08/03 17:49:25 Scanned: 51
2014/08/03 17:49:25 Bytes: <ns>4</ns>
2014/08/03 17:49:25 Scanned: 52
2014/08/03 17:49:25 Bytes: <id>6</id>
2014/08/03 17:49:25 Scanned: 53
2014/08/03 17:49:25 Bytes: <restrictions>edit=autoconfirmed:move=sysop</restrictions>
2014/08/03 17:49:25 Scanned: 54
2014/08/03 17:49:25 Bytes: <revision>
2014/08/03 17:49:25 Scanned: 55
2014/08/03 17:49:25 Bytes: <id>24557508</id>
2014/08/03 17:49:25 Scanned: 56
2014/08/03 17:49:25 Bytes: <parentid>19020708</parentid>
2014/08/03 17:49:25 Scanned: 57
2014/08/03 17:49:25 Bytes: <timestamp>2013-12-30T13:50:49Z</timestamp>
2014/08/03 17:49:25 Scanned: 58
2014/08/03 17:49:25 Bytes: <contributor>
2014/08/03 17:49:25 Scanned: 59
Yet when I print line instead (what the goroutine is receiving), I get the output below. After line 51, the channel blocks and main scans and passes 51 more values to the channel. However, the next line that the goroutine reads is incorrect, and more than that, it is clearly malformed.
Bytes: <namespace key="828" case="case-sensitive">Module</namespace>
2014/08/03 17:40:52 Bytes: <namespace key="829" case="case-sensitive">Module talk</namespace>
2014/08/03 17:40:52 Bytes: </namespaces>
2014/08/03 17:40:52 Bytes: </siteinfo>
2014/08/03 17:40:52 Bytes: <page>
2014/08/03 17:40:52 Bytes: <title>Wiktionary:Welcome, newcomers</title>
2014/08/03 17:40:52 Scanned: 52
2014/08/03 17:40:52 Scanned: 53
2014/08/03 17:40:52 Scanned: 54
2014/08/03 17:40:52 Scanned: 55
2014/08/03 17:40:52 Scanned: 56
2014/08/03 17:40:52 Scanned: 57
2014/08/03 17:40:52 Scanned: 58
2014/08/03 17:40:52 Scanned: 59
2014/08/03 17:40:52 Scanned: 60
2014/08/03 17:40:52 Scanned: 61
2014/08/03 17:40:52 Scanned: 62
2014/08/03 17:40:52 Scanned: 63
2014/08/03 17:40:52 Scanned: 64
2014/08/03 17:40:52 Scanned: 65
2014/08/03 17:40:52 Scanned: 66
2014/08/03 17:40:52 Scanned: 67
2014/08/03 17:40:52 Scanned: 68
2014/08/03 17:40:52 Scanned: 69
2014/08/03 17:40:52 Scanned: 70
2014/08/03 17:40:52 Scanned: 71
2014/08/03 17:40:52 Scanned: 72
2014/08/03 17:40:52 Scanned: 73
2014/08/03 17:40:52 Scanned: 74
2014/08/03 17:40:52 Scanned: 75
2014/08/03 17:40:52 Scanned: 76
2014/08/03 17:40:52 Scanned: 77
2014/08/03 17:40:52 Scanned: 78
2014/08/03 17:40:52 Scanned: 79
2014/08/03 17:40:52 Scanned: 80
2014/08/03 17:40:52 Scanned: 81
2014/08/03 17:40:52 Scanned: 82
2014/08/03 17:40:52 Scanned: 83
2014/08/03 17:40:52 Scanned: 84
2014/08/03 17:40:52 Scanned: 85
2014/08/03 17:40:52 Scanned: 86
2014/08/03 17:40:52 Scanned: 87
2014/08/03 17:40:52 Scanned: 88
2014/08/03 17:40:52 Scanned: 89
2014/08/03 17:40:52 Scanned: 90
2014/08/03 17:40:52 Scanned: 91
2014/08/03 17:40:52 Scanned: 92
2014/08/03 17:40:52 Scanned: 93
2014/08/03 17:40:52 Scanned: 94
2014/08/03 17:40:52 Scanned: 95
2014/08/03 17:40:52 Scanned: 96
2014/08/03 17:40:52 Scanned: 97
2014/08/03 17:40:52 Scanned: 98
2014/08/03 17:40:52 Scanned: 99
2014/08/03 17:40:52 Scanned: 100
2014/08/03 17:40:52 Scanned: 101
2014/08/03 17:40:52 Scanned: 102
2014/08/03 17:40:52 Bytes: nd other refer
2014/08/03 17:40:52 Bytes: nce and instru
2014/08/03 17:40:52 Bytes: tional materials. It stipulates that any copy of the material,
2014/08/03 17:40:52 Bytes: even if modifi
2014/08/03 17:40:52 Bytes: d, carry the same licen
2014/08/03 17:40:52 Bytes: e. Those copies may be sold but, if
2014/08/03 17:40:52 Bytes: produced in quantity, have to be made available i
2014/08/03 17:40:52 Bytes: a format which fac
2014/08/03 17:40:52 Bytes: litates further editing.
I have tried to reproduce this in the Go playground but I have been unsuccessful - it seems like this is something to do with the way slices are passed in channels.
The function Scanner.Bytes may return the same slice used internally by the scanner.
func (s *Scanner) Bytes() []byte
Bytes returns the most recent token generated by a call to Scan. The underlying array may point to data that will be overwritten by a subsequent call to Scan. It does no allocation.
As per documentation, this slice may be overwritten by subsequent calls to Scanner.Scan
. Since your code does not ensure that this slice is not used after the next call to Scanner.Scan
(and in fact your code produces lines and consumes them asynchonously), it may contain garbage at the point where you're trying to use it.
Explicitly copy the slice to make sure that the data is not being overwritten by subsequent calls to Scanner.Scan
.
input <- append(nil, newestBytes...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With