Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read only specific fields from large JSON and import into a Pandas Dataframe

I have a folder with more or less 10 json files that size between 500 and 1000 Mb. Each file contains about 1.000.000 of lines like the loffowling:

{ 
    "dateTime": '2019-01-10 01:01:000.0000'
    "cat": 2
    "description": 'This description'
    "mail": '[email protected]'
    "decision":[{"first":"01", "second":"02", "third":"03"},{"first":"04", "second":"05", "third":"06"}] 
    "Field001": 'data001'
    "Field002": 'data002'
    "Field003": 'data003'
    ...
    "Field999": 'data999'
}

My target is to analyze it with pandas so I would like to save the data coming from all the files into a Dataframe. If I loop all the files Python crash because I don't have free resources to manage the data.

As for my purpose I only need a Dataframe with two columns cat and dateTime from all the files, which I suppose is lighter that a whole Dataframe with all the columns I have tryed to read only these two columns with the following snippet:

Note: at the moment I am working with only one file, when I get a fast reader code I will loop to all other files (A.json, B.json, ...)

import pandas as pd
import json
import os.path
from glob import glob

cols = ['cat', 'dateTime']
df = pd.DataFrame(columns=cols)

file_name='this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
    for line in f:
        data=json.loads(line)
        lst_dict=({'cat':data['cat'], 'dateTime':data['dateTime']})
        df = df.append(lst_dict, ignore_index=True)

The code works, but it is very very slow so it takes more than one hour for one, file while reading all the file and storing into a Dataframe usually takes me 8-10 minutes.

Is there a way to read only two specific columns and append to a Dataframe in a faster way?

I have tryed to read all the JSON file and store into a Dataframe, then drop all the columns but 'cat' and 'dateTime' but it seems to be too heavy for my MacBook.

like image 871
Nicolaesse Avatar asked Jan 10 '19 08:01

Nicolaesse


1 Answers

I had the same problem. I found out that appending a dict to a DataFrame is very very slow. Extract the values as a list instead. In my case it took 14 s instead of 2 h.

cols = ['cat', 'dateTime']
data = []
file_name = 'this_is_my_path/File_A.json'

with open(file_name, encoding='latin-1') as f:
    for line in f:
        doc = json.loads(line)
        lst = [doc['cat'], doc['dateTime']]
        data.append(lst)

df = pd.DataFrame(data=data, columns=cols)
like image 165
Mozak Avatar answered Sep 30 '22 18:09

Mozak