Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Segmentation Fault when using json.loads -- alternative way to load JSON into a list?

Tags:

python

json

I am trying to load some JSON Twitter data into a list, but instead I'm getting segmemtation fault (core dumped).

While I would love to upgrade my memory, that simply isn't an option right now. I would like to know if there is some way to maybe iterate over this list instead of trying to trying to load it all into memory? Or maybe there is a different kind of solution to this problem that will allow me to load this JSON data into a list?

In [1]: import json

In [2]: data = []

In [3]: for i in open('tweets.json'):
   ...:     try:
   ...:         data.append(json.loads(i))
   ...:     except:
   ...:         pass
   ...:     

Segmentation fault (core dumped)

The data was collected using the Twitter Streaming API over about 10 days and is 213M in size.

Machine Specs:

  • Oracle VM Virtual Box
  • Operating System: Ubuntu (64 bit)
  • Base Memory: 1024 MB
  • Video Memory: 128 MB
  • Storage (Virtual Size): 8.00 GB Dynamically allocated

I'm using iPython (version 2.7.6), and accessing it through a Linux terminal window.

like image 486
CurtLH Avatar asked Nov 10 '22 06:11

CurtLH


1 Answers

On almost any modern machine, a 213MB file is very tiny and easily fits into memory. I've loaded larger tweet datasets into memory on average modern machines. But perhaps you (or someone else reading this later) aren't working on a modern machine, or it is a modern machine with an especially small memory capacity.

If it is indeed the size of the data causing the segmentation fault, then you may try the ijson module for iterating over chunks of the JSON document.

Here's an example from that project's page:

import ijson

parser = ijson.parse(urlopen('http://.../'))
stream.write('<geo>')
for prefix, event, value in parser:
    if (prefix, event) == ('earth', 'map_key'):
        stream.write('<%s>' % value)
        continent = value
    elif prefix.endswith('.name'):
        stream.write('<object name="%s"/>' % value)
    elif (prefix, event) == ('earth.%s' % continent, 'end_map'):
        stream.write('</%s>' % continent)
stream.write('</geo>')
like image 67
ely Avatar answered Nov 13 '22 10:11

ely