I finally have output of data I need from a file with many json objects but I need some help with converting the below output into a single dataframe as it loops through the data. Here is the code to produce the output including a sample of what the output looks like:
original data:
{
"zipcode":"08989",
"current"{"canwc":null,"cig":4900,"class":"observation","clds":"OVC","day_ind":"D","dewpt":19,"expireTimeGMT":1385486700,"feels_like":34,"gust":null,"hi":37,"humidex":null,"icon_code":26,"icon_extd":2600,"max_temp":37,"wxMan":"wx1111"},
"triggers":[53,31,9,21,48,7,40,178,55,179,176,26,103,175,33,51,20,57,112,30,50,113]
}
{
"zipcode":"08990",
"current":{"canwc":null,"cig":4900,"class":"observation","clds":"OVC","day_ind":"D","dewpt":19,"expireTimeGMT":1385486700,"feels_like":34,"gust":null,"hi":37,"humidex":null,"icon_code":26,"icon_extd":2600,"max_temp":37, "wxMan":"wx1111"},
"triggers":[53,31,9,21,48,7,40,178,55,179,176,26,103,175,33,51,20,57,112,30,50,113]
}
def lines_per_n(f, n):
for line in f:
yield ''.join(chain([line], itertools.islice(f, n - 1)))
for fin in glob.glob('*.txt'):
with open(fin) as f:
for chunk in lines_per_n(f, 5):
try:
jfile = json.loads(chunk)
zipcode = jfile['zipcode']
datetime = jfile['current']['proc_time']
triggers = jfile['triggers']
print pd.Series(jfile['zipcode']),
pd.Series(jfile['current']['proc_time']),\
jfile['triggers']
except ValueError, e:
pass
else:
pass
Sample output I get when I run the above which I would like to store in a pandas dataframe as 3 columns.
08988 20131126102946 []
08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]
08988 20131126102946 []
08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]
00544 20131126102946 [178, 30, 176, 103, 179, 112, 21, 20, 48]
So the below code seems a lot closer in that it gives me a funky df if I pass the in the list and Transpose the df. Any idea on how I can get this reshaped properly?
def series_chunk(chunk):
jfile = json.loads(chunk)
zipcode = jfile['zipcode']
datetime = jfile['current']['proc_time']
triggers = jfile['triggers']
return jfile['zipcode'],\
jfile['current']['proc_time'],\
jfile['triggers']
for fin in glob.glob('*.txt'):
with open(fin) as f:
for chunk in lines_per_n(f, 7):
df1 = pd.DataFrame(list(series_chunk(chunk)))
print df1.T
[u'08988', u'20131126102946', []]
[u'08989', u'20131126102946', [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]]
[u'08988', u'20131126102946', []]
[u'08989', u'20131126102946', [53, 31, 9, 21, 48, 7, 40, 178, 55, 179]]
Dataframe:
0 1 2
0 08988 20131126102946 []
0 1 2
0 08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
0 1 2
0 08988 20131126102946 []
0 1 2
0 08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
Here is my final code and output. How do I capture each dataframe it creates through the loop and concatenate them on the fly as one dataframe object?
for fin in glob.glob('*.txt'):
with open(fin) as f:
print pd.concat([series_chunk(chunk) for chunk in lines_per_n(f, 7)], axis=1).T
0 1 2
0 08988 20131126102946 []
1 08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
0 1 2
0 08988 20131126102946 []
1 08989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
read_json
function:# can either pass string of the json, or a filepath to a file with valid json
In [99]: pd.read_json('[{"A": 1, "B": 2}, {"A": 3, "B": 4}]')
Out[99]:
A B
0 1 2
1 3 4
Check out the IO part of the docs for several examples, arguments you can pass to this function, as well as ways to normalize less structured json.
If you don't have valid json, it's often efficient to munge the string before reading in as json, for example see this answer.
If you have several json files you should concat the DataFrames together (similar to in this answer):
pd.concat([pd.read_json(file) for file in ...], ignore_index=True)
Use a lookbehind in the regex for the separator passed to read_csv:
In [11]: df = pd.read_csv('foo.csv', sep='(?<!,)\s', header=None)
In [12]: df
Out[12]:
0 1 2
0 8988 20131126102946 []
1 8989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
2 8988 20131126102946 []
3 8989 20131126102946 [53, 31, 9, 21, 48, 7, 40, 178, 55, 179, 176, ...
4 544 20131126102946 [178, 30, 176, 103, 179, 112, 21, 20, 48, 7, 5...
5 601 20131126094911 []
6 602 20131126101056 []
7 603 20131126101056 []
8 604 20131126101056 []
9 544 20131126102946 [178, 30, 176, 103, 179, 112, 21, 20, 48, 7, 5...
10 601 20131126094911 []
11 602 20131126101056 []
12 603 20131126101056 []
13 604 20131126101056 []
[14 rows x 3 columns]
As mentioned in the comments you may be able to do this more directly by concat several Series together... It's also going to be a little easier to follow:
def series_chunk(chunk):
jfile = json.loads(chunk)
zipcode = jfile['zipcode']
datetime = jfile['current']['proc_time']
triggers = jfile['triggers']
return pd.Series([jfile['zipcode'], jfile['current']['proc_time'], jfile['triggers']])
dfs = []
for fin in glob.glob('*.txt'):
with open(fin) as f:
df = pd.concat([series_chunk(chunk) for chunk in lines_per_n(f, 5)], axis=1)
dfs.append(dfs)
df = pd.concat(dfs, ignore_index=True)
Note: You can also move the try/except into series_chunk
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With