Mongoexport exporting invalid json files

Question

I collected some tweets from the twitter API and stored it to mongodb, I tried exporting the data to a JSON file and didn't have any issues there, until I tried to make a python script to read the JSON and convert it to a csv. I get this traceback error with my code:

json.decoder.JSONDecodeError: Extra data: line 367 column 1 (char 9745)

So, after digging around the internet I was pointed to check the actual JSON data in an online validator, which I did. This gave me the error of:

Multiple JSON root elements

from the site https://jsonformatter.curiousconcept.com/

Here are pictures of the 1st/2nd object beginning/end of the file: enter image description here

or a link to the data here

Now, the problem is, I haven't found anything on the internet of how to handle that error. I'm not sure if it's an error with the data I've collected, exported, or if I just don't know how to work with it.

My end game with these tweets is to make a network graph. I was looking at either Networkx or Gephi, which is why I'd like to get a csv file.

Mohammad Yusuf · Accepted Answer

Robert Moskal is right. If you can address the issue at source and use --jsonArray flag when you use mongoexport then it will make the problem easier i guess. If you can't address it at source then read the below points.
The code below will extract you the individual json objects from the given file and convert them to python dictionaries.
You can then apply your CSV logic to each individual dictionary.
If you are using csv module then I would say use unicodecsv module as it would handle the unicode data in your json objects.

import json

with open('path_to_your_json_file', 'rb') as infile:    
json_block = []        
for line in infile:            
    json_block.append(line)            
    if line.startswith('}'):                
        json_dict = json.loads(''.join(json_block))                         
        json_block = []
        print json_dict

If you want to convert it to CSV using pandas you can use the below code:

import json, pandas as pd

with open('path_to_your_json_file', 'rb') as infile:
    json_block = []
    dictlist=[]
    for line in infile:
        json_block.append(line)
        if line.startswith('}'):
            json_dict = json.loads(''.join(json_block))
            dictlist.append(json_dict)
            json_block = []

df = pd.DataFrame(jsonlist)
df.to_csv('out.csv',encoding='utf-8')

If you want to flatten out the json object you can use pandas.io.json.json_normalize() method.

jtlindsey · Answer

Elaborating on @MYGz suggestion to use --jsonArray

Your post doesn't show how you exported the data from mongo. If you use the following via the terminal, you will get valid json from mongodb:

mongoexport --collection=somecollection --db=somedb --jsonArray --out=validfile.json

Replace somecollection, somedb and validfile.json with your target collection, target database, and desired output filename respectively.

The following: mongoexport --collection=somecollection --db=somedb --out=validfile.json...will NOT give you the results you are looking for because:

By default mongoexport writes data using one JSON document for every MongoDB document. Ref

aitorhh · Answer

A bit late reply, and I am not sure it was available the time this question was posted. Anyway, now there is a simple way to import the mongoexport json data as follows:

df = pd.read_json(filename, lines=True)

mongoexport provides each line as a json objects itself, instead of the whole file as json.

Mongoexport exporting invalid json files

Tags:

python

json

csv

mongodb

twitter

Graham Chapman

3 Answers

Mohammad Yusuf

jtlindsey

aitorhh

Recent Activity

Donate For Us

Mongoexport exporting invalid json files

Tags:

python

json

csv

mongodb

twitter

Graham Chapman

3 Answers

Mohammad Yusuf

jtlindsey

aitorhh

Related questions

Recent Activity

Donate For Us