I'm reading data from a database (50k+ rows) where one column is stored as JSON. I want to extract that into a pandas dataframe. The snippet below works fine but is fairly inefficient and really takes forever when run against the whole db. Note that not all the items have the same attributes and that the JSON have some nested attributes. How could I make this faster? <pre class="prettyprint"><code>import pandas as pd import json df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \ header=None, index_col=0, names=['data']) df.data.apply(json.loads) \ .apply(pd.io.json.json_normalize)\ .pipe(lambda x: pd.concat(x.values)) ###this returns a dataframe where each JSON key is a column </code></pre>

json_normalize takes an already processed json string or a pandas series of such strings. <pre class="prettyprint"><code>pd.io.json.json_normalize(df.data.apply(json.loads)) </code></pre> <hr> setup <pre class="prettyprint"><code>import pandas as pd import json df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \ header=None, index_col=0, names=['data']) </code></pre>

I think you can first convert <code>string</code> column <code>data</code> to <code>dict</code>, then create <code>list</code> of <code>numpy arrays</code> by <code>values</code> and last <code>DataFrame.from_records</code>: <pre class="prettyprint"><code>df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \ header=None, index_col=0, names=['data']) a = df.data.apply(json.loads).values.tolist() print (pd.DataFrame.from_records(a)) </code></pre> Another idea: <pre class="prettyprint"><code> df = pd.json_normalize(df['data']) </code></pre>

Fast convert JSON column into Pandas dataframe

Tags:

I'm reading data from a database (50k+ rows) where one column is stored as JSON. I want to extract that into a pandas dataframe. The snippet below works fine but is fairly inefficient and really takes forever when run against the whole db. Note that not all the items have the same attributes and that the JSON have some nested attributes.

How could I make this faster?

import pandas as pd import json  df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \                  header=None, index_col=0, names=['data'])  df.data.apply(json.loads) \        .apply(pd.io.json.json_normalize)\        .pipe(lambda x: pd.concat(x.values)) ###this returns a dataframe where each JSON key is a column

740

asked Dec 18 '16 15:12

jodoox

2 Answers

json_normalize takes an already processed json string or a pandas series of such strings.

pd.io.json.json_normalize(df.data.apply(json.loads))

setup

import pandas as pd import json  df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \                  header=None, index_col=0, names=['data'])

139

answered Oct 27 '22 00:10

piRSquared

I think you can first convert string column data to dict, then create list of numpy arrays by values and last DataFrame.from_records:

df = pd.read_csv('http://pastebin.com/raw/7L86m9R2', \                  header=None, index_col=0, names=['data'])  a = df.data.apply(json.loads).values.tolist()  print (pd.DataFrame.from_records(a))

Another idea:

 df = pd.json_normalize(df['data'])

answered Oct 27 '22 00:10

jezrael

Related questions
                            
                                Loop/reflect through all properties in all EF Models to set Column Type
                            
                                Annotation Processor in IntelliJ and Gradle
                            
                                dockerfile: how use CMD or ENTRYPOINT from base image
                            
                                How to mark rpc as deprecated
                            
                                Visual Studio 2017 - Xamarin - The file "obj\Debug\android\bin\packaged_resources" does not exist
                            
                                Unable to locate package language-pack-en
                            
                                Convert bool to int in postgresql [duplicate]
                            
                                Optional Variables in protocol is possible?
                            
                                No module named 'psycopg2._psycopg': ModuleNotFoundError in AWS Lambda
                            
                                How do I upload image in vuejs?
                            
                                Align FrameLayout inside ConstraintLayout to bottom
                            
                                Typescript : underscore convention for members [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With