Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Buffered golang channel losing data

I am trying to parse a huge Wiktionary dump using a goroutine, and am encountering a strange bug where the channel that the goroutine is reading from seems to be losing and corrupting data every time the channel blocks.

func main() {
    inFile, err := os.Open(*srcFile)
    if err != nil {
        log.LogErrorf("Error opening dump: %v", err)
        return
    }
    defer inFile.Close()

    var wg sync.WaitGroup
    input := make(chan []byte, 51)


    go func() {
        wg.Add(1)
        for line := range input {
            log.Printf("Bytes: %s", line)
            // process the line
        }
        wg.Done()
    }()

    scanner := bufio.NewScanner(inFile)
    count := 0
    for scanner.Scan() {
        count++
        log.Printf("Scanned: %d", count)
        if err := scanner.Err(); err != nil {
            log.LogErrorf("Error scanning: %v", err)
        }
        newestBytes := scanner.Bytes()
        log.Printf("Bytes: %s", newestBytes)
        input <- newestBytes
    }
    close(input)
    wg.Wait()
}

When I run this, I get the correct output. In particular, note lines 51 and 52.

2014/08/03 17:49:25 Scanned: 42
2014/08/03 17:49:25 Bytes:       <namespace key="115" case="case-sensitive">Citations talk</namespace>
2014/08/03 17:49:25 Scanned: 43
2014/08/03 17:49:25 Bytes:       <namespace key="116" case="case-sensitive">Sign gloss</namespace>
2014/08/03 17:49:25 Scanned: 44
2014/08/03 17:49:25 Bytes:       <namespace key="117" case="case-sensitive">Sign gloss talk</namespace>
2014/08/03 17:49:25 Scanned: 45
2014/08/03 17:49:25 Bytes:       <namespace key="828" case="case-sensitive">Module</namespace>
2014/08/03 17:49:25 Scanned: 46
2014/08/03 17:49:25 Bytes:       <namespace key="829" case="case-sensitive">Module talk</namespace>
2014/08/03 17:49:25 Scanned: 47
2014/08/03 17:49:25 Bytes:     </namespaces>
2014/08/03 17:49:25 Scanned: 48
2014/08/03 17:49:25 Bytes:   </siteinfo>
2014/08/03 17:49:25 Scanned: 49
2014/08/03 17:49:25 Bytes:   <page>
2014/08/03 17:49:25 Scanned: 50
2014/08/03 17:49:25 Bytes:     <title>Wiktionary:Welcome, newcomers</title>
2014/08/03 17:49:25 Scanned: 51
2014/08/03 17:49:25 Bytes:     <ns>4</ns>
2014/08/03 17:49:25 Scanned: 52
2014/08/03 17:49:25 Bytes:     <id>6</id>
2014/08/03 17:49:25 Scanned: 53
2014/08/03 17:49:25 Bytes:     <restrictions>edit=autoconfirmed:move=sysop</restrictions>
2014/08/03 17:49:25 Scanned: 54
2014/08/03 17:49:25 Bytes:     <revision>
2014/08/03 17:49:25 Scanned: 55
2014/08/03 17:49:25 Bytes:       <id>24557508</id>
2014/08/03 17:49:25 Scanned: 56
2014/08/03 17:49:25 Bytes:       <parentid>19020708</parentid>
2014/08/03 17:49:25 Scanned: 57
2014/08/03 17:49:25 Bytes:       <timestamp>2013-12-30T13:50:49Z</timestamp>
2014/08/03 17:49:25 Scanned: 58
2014/08/03 17:49:25 Bytes:       <contributor>
2014/08/03 17:49:25 Scanned: 59

Yet when I print line instead (what the goroutine is receiving), I get the output below. After line 51, the channel blocks and main scans and passes 51 more values to the channel. However, the next line that the goroutine reads is incorrect, and more than that, it is clearly malformed.

Bytes:       <namespace key="828" case="case-sensitive">Module</namespace>
2014/08/03 17:40:52 Bytes:       <namespace key="829" case="case-sensitive">Module talk</namespace>
2014/08/03 17:40:52 Bytes:     </namespaces>
2014/08/03 17:40:52 Bytes:   </siteinfo>
2014/08/03 17:40:52 Bytes:   <page>
2014/08/03 17:40:52 Bytes:     <title>Wiktionary:Welcome, newcomers</title>
2014/08/03 17:40:52 Scanned: 52
2014/08/03 17:40:52 Scanned: 53
2014/08/03 17:40:52 Scanned: 54
2014/08/03 17:40:52 Scanned: 55
2014/08/03 17:40:52 Scanned: 56
2014/08/03 17:40:52 Scanned: 57
2014/08/03 17:40:52 Scanned: 58
2014/08/03 17:40:52 Scanned: 59
2014/08/03 17:40:52 Scanned: 60
2014/08/03 17:40:52 Scanned: 61
2014/08/03 17:40:52 Scanned: 62
2014/08/03 17:40:52 Scanned: 63
2014/08/03 17:40:52 Scanned: 64
2014/08/03 17:40:52 Scanned: 65
2014/08/03 17:40:52 Scanned: 66
2014/08/03 17:40:52 Scanned: 67
2014/08/03 17:40:52 Scanned: 68
2014/08/03 17:40:52 Scanned: 69
2014/08/03 17:40:52 Scanned: 70
2014/08/03 17:40:52 Scanned: 71
2014/08/03 17:40:52 Scanned: 72
2014/08/03 17:40:52 Scanned: 73
2014/08/03 17:40:52 Scanned: 74
2014/08/03 17:40:52 Scanned: 75
2014/08/03 17:40:52 Scanned: 76
2014/08/03 17:40:52 Scanned: 77
2014/08/03 17:40:52 Scanned: 78
2014/08/03 17:40:52 Scanned: 79
2014/08/03 17:40:52 Scanned: 80
2014/08/03 17:40:52 Scanned: 81
2014/08/03 17:40:52 Scanned: 82
2014/08/03 17:40:52 Scanned: 83
2014/08/03 17:40:52 Scanned: 84
2014/08/03 17:40:52 Scanned: 85
2014/08/03 17:40:52 Scanned: 86
2014/08/03 17:40:52 Scanned: 87
2014/08/03 17:40:52 Scanned: 88
2014/08/03 17:40:52 Scanned: 89
2014/08/03 17:40:52 Scanned: 90
2014/08/03 17:40:52 Scanned: 91
2014/08/03 17:40:52 Scanned: 92
2014/08/03 17:40:52 Scanned: 93
2014/08/03 17:40:52 Scanned: 94
2014/08/03 17:40:52 Scanned: 95
2014/08/03 17:40:52 Scanned: 96
2014/08/03 17:40:52 Scanned: 97
2014/08/03 17:40:52 Scanned: 98
2014/08/03 17:40:52 Scanned: 99
2014/08/03 17:40:52 Scanned: 100
2014/08/03 17:40:52 Scanned: 101
2014/08/03 17:40:52 Scanned: 102
2014/08/03 17:40:52 Bytes: nd other refer
2014/08/03 17:40:52 Bytes: nce and instru
2014/08/03 17:40:52 Bytes: tional materials. It stipulates that any copy of the material,
2014/08/03 17:40:52 Bytes: even if modifi
2014/08/03 17:40:52 Bytes: d, carry the same licen
2014/08/03 17:40:52 Bytes: e. Those copies may be sold but, if
2014/08/03 17:40:52 Bytes: produced in quantity, have to be made available i
2014/08/03 17:40:52 Bytes:  a format which fac
2014/08/03 17:40:52 Bytes: litates further editing. 

I have tried to reproduce this in the Go playground but I have been unsuccessful - it seems like this is something to do with the way slices are passed in channels.

like image 566
a3onstorm Avatar asked Feb 12 '23 08:02

a3onstorm


1 Answers

The function Scanner.Bytes may return the same slice used internally by the scanner.

func (s *Scanner) Bytes() []byte

Bytes returns the most recent token generated by a call to Scan. The underlying array may point to data that will be overwritten by a subsequent call to Scan. It does no allocation.

As per documentation, this slice may be overwritten by subsequent calls to Scanner.Scan. Since your code does not ensure that this slice is not used after the next call to Scanner.Scan (and in fact your code produces lines and consumes them asynchonously), it may contain garbage at the point where you're trying to use it.

Explicitly copy the slice to make sure that the data is not being overwritten by subsequent calls to Scanner.Scan.

input <- append(nil, newestBytes...)
like image 134
fuz Avatar answered Mar 07 '23 23:03

fuz