I have a folder with more or less 10 json files that size between 500 and 1000 Mb. Each file contains about 1.000.000 of lines like the loffowling:
{
"dateTime": '2019-01-10 01:01:000.0000'
"cat": 2
"description": 'This description'
"mail": '[email protected]'
"decision":[{"first":"01", "second":"02", "third":"03"},{"first":"04", "second":"05", "third":"06"}]
"Field001": 'data001'
"Field002": 'data002'
"Field003": 'data003'
...
"Field999": 'data999'
}
My target is to analyze it with pandas so I would like to save the data coming from all the files into a Dataframe. If I loop all the files Python crash because I don't have free resources to manage the data.
As for my purpose I only need a Dataframe with two columns cat
and dateTime
from all the files, which I suppose is lighter that a whole Dataframe with all the columns I have tryed to read only these two columns with the following snippet:
Note: at the moment I am working with only one file, when I get a fast reader code I will loop to all other files (A.json, B.json, ...)
import pandas as pd
import json
import os.path
from glob import glob
cols = ['cat', 'dateTime']
df = pd.DataFrame(columns=cols)
file_name='this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
for line in f:
data=json.loads(line)
lst_dict=({'cat':data['cat'], 'dateTime':data['dateTime']})
df = df.append(lst_dict, ignore_index=True)
The code works, but it is very very slow so it takes more than one hour for one, file while reading all the file and storing into a Dataframe usually takes me 8-10 minutes.
Is there a way to read only two specific columns and append to a Dataframe in a faster way?
I have tryed to read all the JSON file and store into a Dataframe, then drop all the columns but 'cat' and 'dateTime' but it seems to be too heavy for my MacBook.
I had the same problem. I found out that appending a dict to a DataFrame is very very slow. Extract the values as a list instead. In my case it took 14 s instead of 2 h.
cols = ['cat', 'dateTime']
data = []
file_name = 'this_is_my_path/File_A.json'
with open(file_name, encoding='latin-1') as f:
for line in f:
doc = json.loads(line)
lst = [doc['cat'], doc['dateTime']]
data.append(lst)
df = pd.DataFrame(data=data, columns=cols)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With