Ok, after nearly a week of research I'm going to give SO a shot. I have a text file that looks as follows (showing 3 separate json objects as an example but file has 50K of these):
{
"zipcode":"00544",
"current":{"canwc":null,"cig":7000,"class":"observation"},
"triggers":[178,30,176,103,179,112,21,20,48,7,50,40,57]
}
{
"zipcode":"00601",
"current":{"canwc":null,"cig":null,"class":"observation"},
"triggers":[12,23,34,28,100]
}
{
"zipcode":"00602",
"current":{"canwc":null,"cig":null,"class":"observation"},
"triggers":[13,85,43,101,38,31]
}
I know how to work with JSON objects using the Python json library but I'm having a challenge with how to create 50 thousand different json objects from reading the file. (Perhaps I'm not even thinking about this correctly but ultimately I need to deserialize and load into a database) I've tried itertools thinking that I need a generator so I was able to use:
with open(file) as f:
for line in itertools.islice(f, 0, 7): #since every 7 lines is a json object
jfile = json.load(line)
But the above obviously won't work since it is not reading the 7 lines as a single json object and I'm also not sure how to then iterate on entire file and load individual json objects.
The following would give me a list I can slice:
list(open(file))[:7]
Any help would be really appreciated.
Extemely close to what I need and I think literally one step away but still struggling a little with iteration. This will finally get me an iterative printout of all of the dataframes but how do I make it so that I can capture one giant dataframe with all of the pieces essentially concatenated? I could then export that final dataframe to csv etc. (Also is there a better way to upload this result into a database rather than creating a giant dataframe first?)
def lines_per_n(f, n):
for line in f:
yield ''.join(chain([line], itertools.islice(f, n - 1)))
def flatten(jfile):
for k, v in jfile.items():
if isinstance(v, list):
jfile[k] = ','.join(v)
elif isinstance(v, dict):
for kk, vv in v.items():
jfile['%s' % (kk)] = vv
del jfile[k]
return jfile
with open('deadzips.json') as f:
for chunk in lines_per_n(f, 7):
try:
jfile = json.loads(chunk)
pd.DataFrame(flatten(jfile).items())
except ValueError, e:
pass
else:
pass
Load 6 extra lines instead, and pass the string to json.loads()
:
with open(file) as f:
for line in f:
# slice the next 6 lines from the iterable, as a list.
lines = [line] + list(itertools.islice(f, 6))
jfile = json.loads(''.join(lines))
# do something with jfile
json.load()
will slurp up more than just the next object in the file, and islice(f, 0, 7)
would read only the first 7 lines, rather than read the file in 7-line blocks.
You can wrap reading a file in blocks of size N in a generator:
from itertools import islice, chain
def lines_per_n(f, n):
for line in f:
yield ''.join(chain([line], itertools.islice(f, n - 1)))
then use that to chunk up your input file:
with open(file) as f:
for chunk in lines_per_n(f, 7):
jfile = json.loads(chunk)
# do something with jfile
Alternatively, if your blocks turn out to be of variable length, read until you have something that parses:
with open(file) as f:
for line in f:
while True:
try:
jfile = json.loads(line)
break
except ValueError:
# Not yet a complete JSON value
line += next(f)
# do something with jfile
As stated elsewhere, a general solution is to read the file in pieces, append each piece to the last, and try to parse that new chunk. If it doesn't parse, continue until you get something that does. Once you have something that parses, return it, and restart the process. Rinse-lather-repeat until you run out of data.
Here is a succinct generator that will do this:
def load_json_multiple(segments):
chunk = ""
for segment in segments:
chunk += segment
try:
yield json.loads(chunk)
chunk = ""
except ValueError:
pass
Use it like this:
with open('foo.json') as f:
for parsed_json in load_json_multiple(f):
print parsed_json
I hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With