Background
I have big(2Gib<myfile<10GiB) json files that I need to parse. Due to the size of the file, I cannot keep it as a variable and unmarshal it as I need.
This is why I am trying to use json.NewDecoder as shown in the example here.
A bit about data
The data I have is kinda like the followng,
{
"key1" : [ "hundreads_of_nested_objects" ],
"key2" : [ "hundreads_of_nested_objects" ],
"unknown_unexpected_key" : [ "many_nested_objects" ],
........
"keyN" : [ n_objects ]
}
The Code I am trying to use
file, err := os.Open("myfile.json")
dec := json.NewDecoder(file)
for {
t, err := dec.Token()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
fmt.Printf("%T: %v", t, t)
}
Problem Statement
json.NewDecoder?To clarify, a few use cases could be the followings,
key1 or keyN instead of the whole file.[N.B] I am new to development, my question might be too broad. Any guidance to improve it would be helpful too.
For starters, when you use json.Unmarshal you provide it all the bytes of the json input, so you need to read the entire source file before you can even start to allocate memory for your Go representation of the data.
I hardly ever use json.Unmarshal. Use json.NewDecoder like this, which will stream the data into the unmarshaler bit by bit.
You'd still have to fit all of the data representation in memory (at least the parts you modeled), but depending on the data in the Json, that might be quite a bit less memory than the json representation.
For example, in JSON, numbers are represented as strings of digit characters, but can often fit into much smaller int or float64 types. booleans and nulls are also much bigger in Json than their Go representation. The structure characters []{}: probably won't require as much space in go in-memory types. And of course any whitespace in the Json does nothing but make the file larger. (I'd recommend minifying the json to remove unnecessary whitespace, but that's only relevant for storage and won't have much effect once you're streaming the json data).
structs and omit as much data from your model as possibleIf there's a lot of json data that isn't relevant to your operation, omit them from your models.
You can't do this if you're letting the decoder decode into generic types like map[string]interface{} . But if you use the struct based decoding mechanisms, you can specify which fields you want to store, which might significantly decrease the size of your representation in memory. Combined with the streaming decoder that might solve your memory constraint.
Clearly some of your data has unknown keys, so you can't store all of it in a structure. If you can't sufficiently define the structures that match your data, this option's off the table.
Memory is cheap these days, and disk space for swap is even cheaper. If you can get away with adding either to alleviate your memory constraint, so you can fit all the data's representation in memory, that's by far the simplest solution to your problem.
Json is a great format and one of my favorites. But it's not very good for storing large volumes of data because it's cumbersome to access subsets of that data at a time.
Formats like parquet store data in a way that makes the underlying storage much more efficient to query and navigate. This means that instead of reading all the data and then representing it all in memory, you can read the parts of the data you want directly from the on-disk storage.
The same would be true of a well indexed SQL or NoSQL database.
You could even break your data down into multiple json files that you could read and process sequentially.
If you really can't (or won't) add memory or swap space, change the format of the data, or break it into smaller parts, then you have to write a json scanner that can keep track of your location in the json file so that you know how to process the data. You won't have a representation of all the data at once, but you might be able to pick out the pieces you need without having to store a representation of all the data at once.
It's complicated though, and would be specific to the task in question, there's no generic answer for how to do it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With