Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast convert JSON column into Pandas dataframe

Tags:

I'm reading data from a database (50k+ rows) where one column is stored as JSON. I want to extract that into a pandas dataframe. The snippet below works fine but is fairly inefficient and really takes forever when run against the whole db. Note that not all the items have the same attributes and that the JSON have some nested attributes.

How could I make this faster?

import pandas as pd import json  df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \                  header=None, index_col=0, names=['data'])  df.data.apply(json.loads) \        .apply(pd.io.json.json_normalize)\        .pipe(lambda x: pd.concat(x.values)) ###this returns a dataframe where each JSON key is a column 
like image 740
jodoox Avatar asked Dec 18 '16 15:12

jodoox


People also ask

How do u convert JSON object into pandas DataFrame?

You can convert JSON to Pandas DataFrame by simply using read_json() . Just pass JSON string to the function. It takes multiple parameters, for our case I am using orient that specifies the format of JSON string. This function is also used to read JSON files into pandas DataFrame.

Is pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

What is faster than pandas DataFrame?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.


2 Answers

json_normalize takes an already processed json string or a pandas series of such strings.

pd.io.json.json_normalize(df.data.apply(json.loads)) 

setup

import pandas as pd import json  df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \                  header=None, index_col=0, names=['data']) 
like image 139
piRSquared Avatar answered Oct 27 '22 00:10

piRSquared


I think you can first convert string column data to dict, then create list of numpy arrays by values and last DataFrame.from_records:

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \                  header=None, index_col=0, names=['data'])  a = df.data.apply(json.loads).values.tolist()  print (pd.DataFrame.from_records(a)) 

Another idea:

 df = pd.json_normalize(df['data']) 
like image 23
jezrael Avatar answered Oct 27 '22 00:10

jezrael