Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract dictionary value from column in data frame

Tags:

python

pandas

I'm looking for a way to optimize my code.

I have entry data in this form:

import pandas as pn

a=[{'Feature1': 'aa1','Feature2': 'bb1','Feature3': 'cc2' },
 {'Feature1': 'aa2','Feature2': 'bb2' },
 {'Feature1': 'aa1','Feature2': 'cc1' }
 ]
b=['num1','num2','num3']


df= pn.DataFrame({'num':b, 'dic':a })

I would like to extract element 'Feature3' from dictionaries in column 'dic'(if exist) in above data frame. So far I was able to solve it but I don't know if this is the fastest way, it seems to be a little bit over complicated.

Feature3=[]
for idx, row in df['dic'].iteritems():
    l=row.keys()

    if 'Feature3' in l:
        Feature3.append(row['Feature3'])
    else:
        Feature3.append(None)

df['Feature3']=Feature3
print df

Is there a better/faster/simpler way do extract this Feature3 to separate column in the dataframe?

Thank you in advance for help.

like image 693
michalk Avatar asked Feb 29 '16 22:02

michalk


People also ask

How do you convert a dictionary into a data frame?

You can convert a dictionary to Pandas Dataframe using df = pd. DataFrame. from_dict(my_dict) statement.

Can a DataFrame hold dictionary?

The keys of the dictionary are the DataFrame's column labels, and the dictionary values are the data values in the corresponding DataFrame columns. The values can be contained in a tuple, list, one-dimensional NumPy array, Pandas Series object, or one of several other data types.


5 Answers

df['Feature3'] = df['dic'].apply(lambda x: x.get('Feature3'))

Agree with maxymoo. Consider changing the format of your dataframe.

(Sidenote: pandas is generally imported as pd)

like image 37
as133 Avatar answered Oct 11 '22 01:10

as133


You can use a list comprehension to extract feature 3 from each row in your dataframe, returning a list.

feature3 = [d.get('Feature3') for d in df.dic]

If 'Feature3' is not in dic, it returns None by default.

You don't even need pandas, as you can again use a list comprehension to extract the feature from your original dictionary a.

feature3 = [d.get('Feature3') for d in a]
like image 191
Alexander Avatar answered Oct 11 '22 00:10

Alexander


If you apply a Series, you get a quite nice DataFrame:

>>> df.dic.apply(pn.Series)
    Feature1    Feature2    Feature3
0   aa1 bb1 cc2
1   aa2 bb2 NaN
2   aa1 cc1 NaN

From this point, you can just use regular pandas operations.

like image 43
Ami Tavory Avatar answered Oct 11 '22 01:10

Ami Tavory


I think you can first create new DataFrame by comprehension and then create new column like:

df1 = pd.DataFrame([x for x in df['dic']])
print df1
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

df['Feature3'] = df1['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Or one line:

df['Feature3'] = pd.DataFrame([x for x in df['dic']])['Feature3']
print df
                                                 dic   num Feature3
0  {u'Feature2': u'bb1', u'Feature3': u'cc2', u'F...  num1      cc2
1         {u'Feature2': u'bb2', u'Feature1': u'aa2'}  num2      NaN
2         {u'Feature2': u'cc1', u'Feature1': u'aa1'}  num3      NaN

Timings:

len(df) = 3:

In [24]: %timeit pd.DataFrame([x for x in df['dic']])
The slowest run took 4.63 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 596 µs per loop

In [25]: %timeit df.dic.apply(pn.Series)
1000 loops, best of 3: 1.43 ms per loop

len(df) = 3000:

In [27]: %timeit pd.DataFrame([x for x in df['dic']])
100 loops, best of 3: 3.16 ms per loop

In [28]: %timeit df.dic.apply(pn.Series)
1 loops, best of 3: 748 ms per loop
like image 27
jezrael Avatar answered Oct 11 '22 00:10

jezrael


I think you're thinking about the data structures slightly wrong. It's better to create the data frame with the features as columns from the start; pandas is actually smart enough to do this by default:

In [240]: pd.DataFrame(a)
Out[240]:
  Feature1 Feature2 Feature3
0      aa1      bb1      cc2
1      aa2      bb2      NaN
2      aa1      cc1      NaN

You would then add on your "num" column in a separate step, since the data is in a different orientation, either with

df['num'] = b

or

df = df.assign(num = b)

(I prefer the second option since it's got a more functional flavour).

like image 44
maxymoo Avatar answered Oct 11 '22 02:10

maxymoo