Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split / Explode a column of dictionaries into separate columns with pandas

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary of values inside it. The DataFrame df looks like this:

Station ID     Pollutants 8809           {"a": "46", "b": "3", "c": "12"} 8810           {"a": "36", "b": "5", "c": "8"} 8811           {"b": "2", "c": "7"} 8812           {"c": "11"} 8813           {"a": "82", "c": "15"} 

I need to split this column into separate columns, so that the DataFrame `df2 looks like this:

Station ID     a      b       c 8809           46     3       12 8810           36     5       8 8811           NaN    2       7 8812           NaN    NaN     11 8813           82     NaN     15 

The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: 'a', 'b', and 'c'. And they always appear in the same order ('a' first, 'b' second, 'c' third).

The following code USED to work and return exactly what I wanted (df2).

objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]] df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1) print(df2) 

I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:

IndexError: out-of-bounds on slice (end)  

I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.

Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!

EDIT: I think the .tolist() and .apply methods are not working on my code because it is one Unicode string, i.e.:

#My data format  u{'a': '1', 'b': '2', 'c': '3'}  #and not {u'a': '1', u'b': '2', u'c': '3'} 

The data is imported from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?

like image 554
llaffin Avatar asked Jul 06 '16 18:07

llaffin


People also ask

How do I split one column into multiple columns in pandas?

Split column by delimiter into multiple columnsApply the pandas series str. split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.

How do you split items into multiple columns in a data frame?

We can use the pandas Series. str. split() function to break up strings in multiple columns around a given separator or delimiter. It's similar to the Python string split() method but applies to the entire Dataframe column.

Can you pop multiple columns pandas?

If you need to remove multiple columns from your dataset, you can either . pop() multiple times, or use pandas . drop() instead.


2 Answers

To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.


Using a small example, you can use .apply(pd.Series):

In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})  In [3]: df Out[3]:    a                   b 0  1           {u'c': 1} 1  2           {u'd': 3} 2  3  {u'c': 5, u'd': 6}  In [4]: df['b'].apply(pd.Series) Out[4]:      c    d 0  1.0  NaN 1  NaN  3.0 2  5.0  6.0 

To combine it with the rest of the dataframe, you can concat the other columns with the above result:

In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1) Out[7]:    a    c    d 0  1  1.0  NaN 1  2  NaN  3.0 2  3  5.0  6.0 

Using your code, this also works if I leave out the iloc part:

In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1) Out[15]:    a    c    d 0  1  1.0  NaN 1  2  NaN  3.0 2  3  5.0  6.0 
like image 106
joris Avatar answered Sep 29 '22 11:09

joris


I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize:

import pandas as pd  df2 = pd.json_normalize(df['Pollutant Levels']) 

This avoids costly apply functions...

like image 22
Lech Birek Avatar answered Sep 29 '22 12:09

Lech Birek