How to read multiple json files into pandas dataframe?

Tags:

I'm having a hard time loading multiple line delimited JSON files into a single pandas dataframe. This is the code I'm using:

import os, json
import pandas as pd
import numpy as np
import glob
pd.set_option('display.max_columns', None)

temp = pd.DataFrame()

path_to_json = '/Users/XXX/Desktop/Facebook Data/*' 

json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)

for file in file_list:
    data = pd.read_json(file, lines=True)
    temp.append(data, ignore_index = True)

It looks like all the files are loading when I look through file_list, but cannot figure out how to get each file into a dataframe. There are about 50 files with a couple lines in each file.

598

asked Jul 17 '19 02:07

onetap

4 Answers

Change the last line to:

temp = temp.append(data, ignore_index = True)

The reason we have to do this is because the append doesn't happen in place. The append method does not modify the data frame. It just returns a new data frame with the result of the append operation.

Edit:

Since writing this answer I have learned that you should never use DataFrame.append inside a loop because it leads to quadratic copying (see this answer).

What you should do instead is first create a list of data frames and then use pd.concat to concatenate them all in a single operation. Like this:

dfs = [] # an empty list to store the data frames
for file in file_list:
    data = pd.read_json(file, lines=True) # read data frame from json file
    dfs.append(data) # append the data frame to the list

temp = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.

This alternative should be considerably faster.

answered Oct 11 '22 18:10

Juan Estevez

If you need to flatten the JSON, Juan Estevez’s approach won’t work as is. Here is an alternative :

import pandas as pd

dfs = []
for file in file_list:
    with open(file) as f:
        json_data = pd.json_normalize(json.loads(f.read()))
    dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs

Or if your JSON are line-delimited (not tested) :

import pandas as pd

dfs = []
for file in file_list:
    with open(file) as f:
        for line in f.readlines():
            json_data = pd.json_normalize(json.loads(line))
            dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs

answered Oct 11 '22 17:10

Skippy le Grand Gourou

I combined Juan Estevez's answer with glob. Thanks a lot.

import pandas as pd
import glob

def readFiles(path):
    files = glob.glob(path)
    dfs = [] # an empty list to store the data frames
    for file in files:
        data = pd.read_json(file, lines=True) # read data frame from json file
        dfs.append(data) # append the data frame to the list

    df = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
    return df

answered Oct 11 '22 19:10

Ekin Gün Öncü

from pathlib import Path
import pandas as pd

paths = Path("/home/data").glob("*.json")
df = pd.DataFrame([pd.read_json(p, typ="series") for p in paths])```

answered Oct 11 '22 19:10

0-_-0

Related questions
                            
                                Converting a float to bytearray
                            
                                Can't build wheel - error: invalid command 'bdist_wheel'
                            
                                Remove empty sub plots in matplotlib figure
                            
                                How to Remove a Substring of String in a Dataframe Column?
                            
                                What is a mapping object, according to dict type?
                            
                                "Invalid parameter type" (numpy.int64) when inserting rows with executemany()
                            
                                pyspark's "between" function: range search on timestamps is not inclusive
                            
                                How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?
                            
                                ModuleNotFoundError: No module named 'tensorflow.tensorboard.tensorboard'
                            
                                Pytorch: Convert FloatTensor into DoubleTensor
                            
                                Decompose a float into mantissa and exponent in base 10 without strings
                            
                                Getting PIL/Pillow 4.2.1 to upload properly to AWS Lambda Py3.6
                            
                                How to randomly set elements in numpy array to 0
                            
                                Azure Blob - Read using Python
                            
                                Pylint false positive for Flask's "app.logger": E1101: Method 'logger' has no 'debug' member (no-member)
                            
                                Is it correct to modify old migration files in Django?
                            
                                MyPy - "Incompatible types in assignment (expression has type None, variable has type ...)"
                            
                                Error while pushing to Heroku: requested runtime is not available for this stack
                            
                                Django Admin: JSONField default empty dict wont save in admin
                            
                                Check if a class is a dataclass in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to read multiple json files into pandas dataframe?

Tags:

python

json

pandas

dataframe