I have a file, with around 85 million json records. The file size is around 110 Gb. I want to read from this file in batches of 1 million (in sequence). I am trying to read from this file line by line using a scanner, and appending these 1 million records. Here is the code gist of what I am doing:
var rawBatch []string
batchSize := 1000000
file, err := os.Open(filePath)
if err != nil {
// error handling
}
scanner = bufio.NewScanner(file)
for scanner.Scan() {
rec := string(scanner.Bytes())
rawBatch = append(rawBatch, string(recBytes))
if len(rawBatch) == batchSize {
for i := 0; i < batchSize ; i++ {
var tRec parsers.TRecord
err := json.Unmarshal(rawBatch[i], &tRec)
if err != nil {
// Error thrown here
}
}
//process
rawBatch = nil
}
}
file.Close()
Sample of correct record:
type TRecord struct {
Key1 string `json:"key1"`
key2 string `json:"key2"`
}
{"key1":"15","key2":"21"}
The issue I am facing here is that while reading these records, some of these records are getting corrupted, example: changing a colon to semi colon, or double quote to #. Getting this error:
Unable to load Record: Unable to load record in:
{"key1":#15","key2":"21"}
invalid character '#' looking for beginning of value
Some observations:
Any ideas on why this is happening? And how can I fix it?
Details:
As suggested by @icza in the comments,
That occasional, very rare 1 bit change suggests hardware failure (memory, processor cache, hard disk). I do recommend to test it on another computer.
I tested my code on some other machines. The code is running perfectly fine now. Looks like this occasional rare bit change, due to some hard failure, was causing this issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With