I try to make a program in Go to find some genes in very large files of DNA sequences. I already made a Perl program to do that but I would like to take advantage of the goroutines to perform this search in parallel ;)
Because the files are huge, my idea was to read 100 sequences at a time, then send the analysis to a goroutine, and read again 100 sequences etc.
I would like to thank the member of this site for their really helpful explanations concerning slices and goroutines.
I have made the suggested change, to use a copy of the slice processed by the goroutines. But the -race execution still detect one data race at the level of the copy() function :
Thank you very much for your comments !
    ==================
WARNING: DATA RACE
Read by goroutine 6:
  runtime.slicecopy()
      /usr/lib/go-1.6/src/runtime/slice.go:113 +0x0
  main.main.func1()
      test_chan006.go:71 +0xd8
Previous write by main goroutine:
  main.main()
      test_chan006.go:63 +0x3b7
Goroutine 6 (running) created at:
  main.main()
      test_chan006.go:73 +0x4c9
==================
[>5HSAA098909 BA098909 ...]
Found 1 data race(s)
exit status 66
    line 71 is : copy(bufCopy, buf_Seq)
    line 63 is : buf_Seq = append(buf_Seq, line)
    line 73 is :}(genes, buf_Seq)
    package main
import (
    "bufio"
    "fmt"
    "os"
    "github.com/mathpl/golang-pkg-pcre/src/pkg/pcre"
    "sync"
)
// function read a list of genes and return a slice of gene names
func read_genes(filename string) []string {
    var genes []string // slice of genes names
    // Open the file.
    f, _ := os.Open(filename)
    // Create a new Scanner for the file.
    scanner := bufio.NewScanner(f)
    // Loop over all lines in the file and print them.
    for scanner.Scan() {
          line := scanner.Text()
        genes = append(genes, line)
    }
    return genes
}
// function find the sequences with a gene matching gene[] slice
func search_gene2( genes []string, seqs []string) ([]string) {
  var res []string
  for r := 0 ; r <= len(seqs) - 1; r++ {
    for i := 0 ; i <= len(genes) - 1; i++ {
      match := pcre.MustCompile(genes[i], 0).MatcherString(seqs[r], 0)
      if (match.Matches() == true) {
          res = append( res, seqs[r])           // is the gene matches the gene name is append to res
          break
      }
    }
  }
  return res
}
//###########################################
func main() {
    var slice []string
    var buf_Seq []string
    read_buff := 100    // the number of sequences analysed by one goroutine
    var wg sync.WaitGroup
    queue := make(chan []string, 100)
    filename := "fasta/sequences.tsv"
    f, _ := os.Open(filename)
    scanner := bufio.NewScanner(f)
    n := 0
    genes := read_genes("lists/genes.csv")
    for scanner.Scan() {
            line := scanner.Text()
            n += 1
            buf_Seq = append(buf_Seq, line) // store the sequences into buf_Seq
            if n == read_buff {   // when the read buffer contains 100 sequences one goroutine analyses them
          wg.Add(1)
          go func(genes, buf_Seq []string) {
            defer wg.Done()
                        bufCopy := make([]string, len(buf_Seq))
                        copy(bufCopy, buf_Seq)
            queue <- search_gene2( genes, bufCopy)
            }(genes, buf_Seq)
                        buf_Seq = buf_Seq[:0]   // reset buf_Seq
              n = 0 // reset the sequences counter
        }
    }
    go func() {
            wg.Wait()
            close(queue)
        }()
        for t := range queue {
            slice = append(slice, t...)
        }
        fmt.Println(slice)
}
The goroutines are only working on copies of the slice headers, the underlying arrays are the same. To make a copy of a slice, you need to use copy (or append to a different slice).
buf_Seq = append(buf_Seq, line)
bufCopy := make([]string, len(buf_Seq))
copy(bufCopy, buf_Seq)
You can then safely pass bufCopy to the goroutines, or simply use it directly in the closure.
The slices are indeed copies, but slices themselves are reference types. A slice, fundamentally, is a 3-word structure. It contains a pointer to the start of an underlying array, an integer denoting the current number of elements in the slice, and another integer denoting the capacity of the underlying array. When you pass a slice into a function, a copy is made of this slice "header" structure, but the header still refers to the same underlying array as the header that was passed in.
This means any changes you make to the slice header itself, like sub-slicing it, appending to it enough to trigger a resize (and thus a reallocation to a new location, with a new start pointer), etc will only be reflected in the slice header inside that function. Any changes to the underlying data itself, however, will be reflected even in the slice outside the function (unless you triggered a reallocation due by growing the slice past capacity).
Example: https://play.golang.org/p/a2y5eGulXW
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With