I've 2 json files of size data_large(150.1mb)
and data_small(7.5kb)
. The content inside each file is of type [{"score": 68},{"score": 78}]
. I need to find the list of unique scores from each file.
While dealing with data_small, I did the following and I was able to view its content with 0.1 secs
.
with open('data_small') as f:
content = json.load(f)
print content # I'll be applying the logic to find the unique values later.
But while dealing with data_large, I did the following and my system got hanged, slow and had to force shut-it down to bring it into its normal speed. It took around 2 mins
to print its content.
with open('data_large') as f:
content = json.load(f)
print content # I'll be applying the logic to find the unique values later.
How can I increase the efficiency of the program while dealing with large data-sets?
Comparing json is quite simple, we can use '==' operator, Note: '==' and 'is' operator are not same, '==' operator is use to check equality of values , whereas 'is' operator is used to check reference equality, hence one should use '==' operator, 'is' operator will not give expected result.
How large can JSON Documents be? One of the more frequently asked questions about the native JSON data type, is what size can a JSON document be. The short answer is that the maximum size is 1GB. However, JSON often changes how data modeling is done, and deserves a slightly longer response.
It's clear that loading the whole JSON file into memory is a waste of memory. With a larger file, it would be impossible to load at all. Given a JSON file that's structured as a list of objects, we could in theory parse it one chunk at a time instead of all at once.
Since your json file is not that large and you can afford to open it into ram all at once, you can get all unique values like:
with open('data_large') as f:
content = json.load(f)
# do not print content since it prints it to stdout which will be pretty slow
# get the unique values
values = set()
for item in content:
values.add(item['score'])
# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])
# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
# json cant serialize sets hence conversion to list
json.dump(list(values), fid)
If you will need to process even bigger files, then look for other libraries which can parse a json file iteratively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With