I am running a script (in multiprocessing mode) that extract some parameters from a bunch of JSON files but currently it is very slow. Here is the script:
from __future__ import print_function, division
import os
from glob import glob
from os import getpid
from time import time
from sys import stdout
import resource
from multiprocessing import Pool
import subprocess
try:
import simplejson as json
except ImportError:
import json
path = '/data/data//*.A.1'
print("Running with PID: %d" % getpid())
def process_file(file):
start = time()
filename =file.split('/')[-1]
print(file)
with open('/data/data/A.1/%s_DI' %filename, 'w') as w:
with open(file, 'r') as f:
for n, line in enumerate(f):
d = json.loads(line)
try:
domain = d['rrname']
ips = d['rdata']
for i in ips:
print("%s|%s" % (i, domain), file=w)
except:
print (d)
pass
if __name__ == "__main__":
files_list = glob(path)
cores = 12
print("Using %d cores" % cores)
pp = Pool(processes=cores)
pp.imap_unordered(process_file, files_list)
pp.close()
pp.join()
Does any body know how to speed this up?
swith from
import json
to
import ujson
https://artem.krylysov.com/blog/2015/09/29/benchmark-python-json-libraries/
or switch to orjson
import orjson
https://github.com/ijl/orjson
First, find out where your bottlenecks are.
If it is on the json decoding/encoding step, try switching to ultrajson
:
UltraJSON is an ultra fast JSON encoder and decoder written in pure C with bindings for Python 2.5+ and 3.
The changes would be as simple as changing the import part:
try:
import ujson as json
except ImportError:
try:
import simplejson as json
except ImportError:
import json
I've also done a simple benchmark at What is faster - Loading a pickled dictionary object or Loading a JSON file - to a dictionary?, take a look.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With