Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading huge number of json files in Python?

This is not about reading large JSON files, instead it's about reading large number of JSON files in the most efficient way.

Question

I am working with last.fm dataset from the Million song dataset. The data is available as a set of JSON-encoded text files where the keys are: track_id, artist, title, timestamp, similars and tags.

Currently I'm reading them into pandas in the following way after going through a few options as this is the fastest as shown here:

import os
import pandas as pd
try:
    import ujson as json
except ImportError:
    try:
        import simplejson as json
    except ImportError:
        import json


# Path to the dataset
path = "../lastfm_train/"

# Getting list of all json files in dataset
all_files = [os.path.join(root,file) for root, dirs, files in os.walk(path) for file in files if file.endswith('.json')] 

data_list=[json.load(open(file)) for file in all_files]
df = pd.DataFrame(data_list, columns=['similars', 'track_id'])
df.set_index('track_id', inplace=True)

The current method reads the subset (1% of full dataset in less than a second). However, reading the full train set is too slow and takes forever (I have waited for couple of hours as well) to read and has become a bottleneck for further tasks such as shown in question here.

I'm also using ujson for speed purposes in parsing json files which can be seen evidently from this question here

UPDATE 1 Using generator comprehension instead of list comprehension.

data_list=(json.load(open(file)) for file in all_files)
like image 982
TJain Avatar asked Jan 13 '17 15:01

TJain


1 Answers

If you need to read and write the dataset multiple times, you could try converting .json files into a faster format. For example in pandas 0.20+ you could try using the .feather format.

like image 58
tom Avatar answered Oct 30 '22 09:10

tom