I have some json files with 500MB. If I use the "trivial" <code>json.load()</code> to load its content all at once, it will consume a lot of memory. Is there a way to read partially the file? If it was a text, line delimited file, I would be able to iterate over the lines. I am looking for analogy to it.

So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try: <ol> <li> Modularize your code. Do something like: <pre class="prettyprint"><code>for json_file in list_of_files: process_file(json_file) </code></pre> If you write <code>process_file()</code> in such a way that it doesn't rely on any global state, and doesn't change any global state, the garbage collector should be able to do its job. </li> <li>Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a program that parses just one, and pass each one in from a shell script, or from another python process that calls your script via <code>subprocess.Popen</code>. This is a little less elegant, but if nothing else works, it will ensure that you're not holding on to stale data from one file to the next.</li> </ol> Hope this helps.

Is there a memory efficient and fast way to load big JSON files?

2 Answers

There was a duplicate to this question that had a better answer. See https://stackoverflow.com/a/10382359/1623645, which suggests ijson.

Update:

I tried it out, and ijson is to JSON what SAX is to XML. For instance, you can do this:

import ijson for prefix, the_type, value in ijson.parse(open(json_file_name)):     print prefix, the_type, value

where prefix is a dot-separated index in the JSON tree (what happens if your key names have dots in them? I guess that would be bad for Javascript, too...), theType describes a SAX-like event, one of 'null', 'boolean', 'number', 'string', 'map_key', 'start_map', 'end_map', 'start_array', 'end_array', and value is the value of the object or None if the_type is an event like starting/ending a map/array.

The project has some docstrings, but not enough global documentation. I had to dig into ijson/common.py to find what I was looking for.

answered Sep 21 '22 06:09

Jim Pivarski

So the problem is not that each file is too big, but that there are too many of them, and they seem to be adding up in memory. Python's garbage collector should be fine, unless you are keeping around references you don't need. It's hard to tell exactly what's happening without any further information, but some things you can try:

Modularize your code. Do something like:
```
for json_file in list_of_files:     process_file(json_file) 
```
If you write process_file() in such a way that it doesn't rely on any global state, and doesn't change any global state, the garbage collector should be able to do its job.
Deal with each file in a separate process. Instead of parsing all the JSON files at once, write a program that parses just one, and pass each one in from a shell script, or from another python process that calls your script via subprocess.Popen. This is a little less elegant, but if nothing else works, it will ensure that you're not holding on to stale data from one file to the next.

Hope this helps.

answered Sep 22 '22 06:09

jcdyer

Related questions
                            
                                How to print Docstring of python function from inside the function itself?
                            
                                Why is a trailing comma a SyntaxError in an argument list that uses *args syntax?
                            
                                What does Python's socket.recv() return for non-blocking sockets if no data is received until a timeout occurs?
                            
                                Why doesn't Pylint like built-in functions?
                            
                                Generating movie from python without saving individual frames to files
                            
                                How to print all variables values when debugging Python with pdb, without specifying each variable?
                            
                                Difference between Class and Instance methods
                            
                                Does virtualenv serve a purpose (in production) when using docker?
                            
                                What is the Python egg cache (PYTHON_EGG_CACHE)?
                            
                                importing izip from itertools module gives NameError in Python 3.x
                            
                                Which maximum does Python pick in the case of a tie?
                            
                                Is everything greater than None?
                            
                                Find length of longest string in Pandas dataframe column
                            
                                Why is Python list slower when sorted?
                            
                                access ElementTree node parent node
                            
                                Docker process killed with cryptic `Killed` message
                            
                                Matplotlib/Pandas error using histogram
                            
                                How to group a Series by values in pandas?
                            
                                A safe max() function for empty lists
                            
                                Can't use unichr in Python 3.1

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a memory efficient and fast way to load big JSON files?

Tags:

python

json

large-files

duduklein

People also ask

2 Answers

Jim Pivarski

jcdyer

Recent Activity

Donate For Us