I have data saved in a <code>postgreSQL</code> database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary of values inside it. The DataFrame <code>df</code> looks like this: <pre class="prettyprint"><code>Station ID Pollutants 8809 {"a": "46", "b": "3", "c": "12"} 8810 {"a": "36", "b": "5", "c": "8"} 8811 {"b": "2", "c": "7"} 8812 {"c": "11"} 8813 {"a": "82", "c": "15"} </code></pre> I need to split this column into separate columns, so that the DataFrame `df2 looks like this: <pre class="prettyprint"><code>Station ID a b c 8809 46 3 12 8810 36 5 8 8811 NaN 2 7 8812 NaN NaN 11 8813 82 NaN 15 </code></pre> The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: 'a', 'b', and 'c'. And they always appear in the same order ('a' first, 'b' second, 'c' third). The following code USED to work and return exactly what I wanted (df2). <pre class="prettyprint"><code>objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]] df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1) print(df2) </code></pre> I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]: <pre class="prettyprint"><code>IndexError: out-of-bounds on slice (end) </code></pre> I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper. Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated! EDIT: I think the <code>.tolist()</code> and .apply methods are not working on my code because it is one Unicode string, i.e.: <pre class="prettyprint"><code>#My data format u{'a': '1', 'b': '2', 'c': '3'} #and not {u'a': '1', u'b': '2', u'c': '3'} </code></pre> The data is imported from the <code>postgreSQL</code> database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?

To convert the string to an actual dict, you can do <code>df['Pollutant Levels'].map(eval)</code>. Afterwards, the solution below can be used to convert the dict to different columns. <hr> Using a small example, you can use <code>.apply(pd.Series)</code>: <pre class="prettyprint"><code>In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]}) In [3]: df Out[3]: a b 0 1 {u'c': 1} 1 2 {u'd': 3} 2 3 {u'c': 5, u'd': 6} In [4]: df['b'].apply(pd.Series) Out[4]: c d 0 1.0 NaN 1 NaN 3.0 2 5.0 6.0 </code></pre> To combine it with the rest of the dataframe, you can <code>concat</code> the other columns with the above result: <pre class="prettyprint"><code>In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1) Out[7]: a c d 0 1 1.0 NaN 1 2 NaN 3.0 2 3 5.0 6.0 </code></pre> <hr> Using your code, this also works if I leave out the <code>iloc</code> part: <pre class="prettyprint"><code>In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1) Out[15]: a c d 0 1 1.0 NaN 1 2 NaN 3.0 2 3 5.0 6.0 </code></pre>

I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using <code>json_normalize</code>: <pre class="prettyprint"><code>import pandas as pd df2 = pd.json_normalize(df['Pollutant Levels']) </code></pre> This avoids costly apply functions...

Split / Explode a column of dictionaries into separate columns with pandas

Tags:

python

json

dictionary

pandas

json-normalize

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary of values inside it. The DataFrame df looks like this:

Station ID     Pollutants 8809           {"a": "46", "b": "3", "c": "12"} 8810           {"a": "36", "b": "5", "c": "8"} 8811           {"b": "2", "c": "7"} 8812           {"c": "11"} 8813           {"a": "82", "c": "15"}

I need to split this column into separate columns, so that the DataFrame `df2 looks like this:

Station ID     a      b       c 8809           46     3       12 8810           36     5       8 8811           NaN    2       7 8812           NaN    NaN     11 8813           82     NaN     15

The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: 'a', 'b', and 'c'. And they always appear in the same order ('a' first, 'b' second, 'c' third).

The following code USED to work and return exactly what I wanted (df2).

objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]] df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1) print(df2)

I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:

IndexError: out-of-bounds on slice (end)

I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.

Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!

EDIT: I think the .tolist() and .apply methods are not working on my code because it is one Unicode string, i.e.:

#My data format  u{'a': '1', 'b': '2', 'c': '3'}  #and not {u'a': '1', u'b': '2', u'c': '3'}

The data is imported from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?

554

asked Jul 06 '16 18:07

llaffin

2 Answers

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.

Using a small example, you can use .apply(pd.Series):

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})  In [3]: df Out[3]:    a                   b 0  1           {u'c': 1} 1  2           {u'd': 3} 2  3  {u'c': 5, u'd': 6}  In [4]: df['b'].apply(pd.Series) Out[4]:      c    d 0  1.0  NaN 1  NaN  3.0 2  5.0  6.0

To combine it with the rest of the dataframe, you can concat the other columns with the above result:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1) Out[7]:    a    c    d 0  1  1.0  NaN 1  2  NaN  3.0 2  3  5.0  6.0

Using your code, this also works if I leave out the iloc part:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1) Out[15]:    a    c    d 0  1  1.0  NaN 1  2  NaN  3.0 2  3  5.0  6.0

106

answered Sep 29 '22 11:09

joris

I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize:

import pandas as pd  df2 = pd.json_normalize(df['Pollutant Levels'])

This avoids costly apply functions...

answered Sep 29 '22 12:09

Lech Birek

Related questions
                            
                                UnboundLocalError on local variable when reassigned after first use
                            
                                How to implement common bash idioms in Python? [closed]
                            
                                Does Python optimize tail recursion?
                            
                                Getting a list of values from a list of dicts
                            
                                Writing Unicode text to a text file?
                            
                                Understanding the difference between __getattr__ and __getattribute__
                            
                                Progress indicator during pandas operations
                            
                                Calculating a directory's size using Python?
                            
                                How do I mock an open used in a with statement (using the Mock framework in Python)?
                            
                                How do I keep Python print from adding newlines or spaces? [duplicate]
                            
                                SQLAlchemy default DateTime
                            
                                Converting Dictionary to List? [duplicate]
                            
                                What is the difference between Python and IPython?
                            
                                Could not find a version that satisfies the requirement <package>
                            
                                What is the pythonic way to detect the last element in a 'for' loop?
                            
                                Passing functions with arguments to another function in Python?
                            
                                How to execute Python code from within Visual Studio Code
                            
                                Named regular expression group "(?P<group_name>regexp)": what does "P" stand for?
                            
                                ValueError: setting an array element with a sequence
                            
                                Appending the same string to a list of strings in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With