Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I to translate this json format into correct format that can be used pandas read_json()

This is first time use stackoverflow to ask question. I have poor English,so if I affend you accidently in word, please don't mind.

I have a json file (access.json),format like:

[
{u'IP': u'aaaa1', u'Domain': u'bbbb1', u'Time': u'cccc1', ..... },
{u'IP': u'aaaa2', u'Domain': u'bbbb2', u'Time': u'cccc2', ..... },
{u'IP': u'aaaa3', u'Domain': u'bbbb3', u'Time': u'cccc3', ..... },
{u'IP': u'aaaa4', u'Domain': u'bbbb4', u'Time': u'cccc4', ..... },
{ ....... }, 
{ ....... } 
]

When I use:

ipython
import pasdas as pd
data = pd.read_json('./access.json')

it return:

ValueError: Expected object or value

that is the result I want:

[out]
       IP    Domain     Time    ...
0   aaaa1     bbbb1    cccc1    ...
1   aaaa2     bbbb2    cccc2    ...
2   aaaa3     bbbb3    cccc3    ...
3   aaaa4     bbbb4    cccc4    ...
...and so on

How should I do to achieve this goal? Thank you for answer!

like image 550
Yu-Slang Chen Avatar asked Jun 09 '14 03:06

Yu-Slang Chen


2 Answers

This isn't valid json which is why read_json won't parse it.

{u'IP': u'aaaa1', u'Domain': u'bbbb1', u'Time': u'cccc1', ..... },

should be

{"IP": "aaaa1", "Domain": "bbbb1", "Time": "cccc1", ..... },

You could smash this (the entire file) with a regular expression to find these, for example:

In [11]: line
Out[11]: "{u'IP': u'aaaa1', u'Domain': u'bbbb1', u'Time': u'cccc1'},"

In [12]: re.sub("(?<=[\{ ,])u'|'(?=[:,\}])", '"', line)
Out[12]: '{"IP": "aaaa1", "Domain": "bbbb1", "Time": "cccc1"},'

Note: this will get tripped up by some strings, so use with caution.

A better "solution" would be to ensure you had valid json in the first place... It looks like this has come from python's str/unicode/repr rather than json.dumps.

Note: json.dumps produces valid json, so can be read by read_json.

In [21]: repr({u'IP': u'aaa'})
Out[21]: "{u'IP': u'aaa'}"

In [22]: json.dumps({u'IP': u'aaa'})
Out[22]: '{"IP": "aaa"}'

If someone else created this "json", then complain! It's not json.

like image 102
Andy Hayden Avatar answered Oct 18 '22 04:10

Andy Hayden


It is not a JSON format. It is a list of dictionaries. You can use ast.literal_eval() to get the actual list from the file and pass it to the DataFrame constructor:

from ast import literal_eval
import pandas as pd

with open('./access.log2.json') as f:
    data = literal_eval(f.read())

df = pd.DataFrame(data)
print df

Output for the example data you've provided:

  Domain     IP   Time
0  bbbb1  aaaa1  cccc1
1  bbbb2  aaaa2  cccc2
2  bbbb3  aaaa3  cccc3
3  bbbb4  aaaa4  cccc4
like image 4
alecxe Avatar answered Oct 18 '22 04:10

alecxe