Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

make pandas DataFrame to a dict and dropna

Tags:

python

pandas

I have some pandas DataFrame with NaNs in it. Like this:

import pandas as pd import numpy as np raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}} data=pd.DataFrame(raw_data) >>> data    A   B 1  2 NaN 2  3  44 3  4 NaN 

Now I want to make a dict out of it and at the same time remove the NaNs. The result should look like this:

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}} 

But using pandas to_dict function gives me a result like this:

>>> data.to_dict() {'A': {1: 2, 2: 3, 3: 4}, 'B': {1: nan, 2: 44.0, 3: nan}}  

So how to make a dict out of the DataFrame and get rid of the NaNs ?

like image 506
der_die_das_jojo Avatar asked Sep 25 '14 07:09

der_die_das_jojo


People also ask

How do I convert a Pandas DataFrame to a dictionary?

To convert pandas DataFrame to Dictionary object, use to_dict() method, this takes orient as dict by default which returns the DataFrame in format {column -> {index -> value}} . When no orient is specified, to_dict() returns in this format.

Can you store a Pandas DataFrame in a dictionary?

Overview: A pandas DataFrame can be converted into a Python dictionary using the DataFrame instance method to_dict(). The output can be specified of various orientations using the parameter orient.

How do I merge two Dataframes in Pandas?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.

What does DF Dropna () do?

Definition and Usage. The dropna() method removes the rows that contains NULL values. The dropna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the dropna() method does the removing in the original DataFrame instead.


2 Answers

There are many ways you could accomplish this, I spent some time evaluating performance on a not-so-large (70k) dataframe. Although @der_die_das_jojo's answer is functional, it's also pretty slow.

The answer suggested by this question actually turns out to be about 5x faster on a large dataframe.

On my test dataframe (df):

Above method:

%time [ v.dropna().to_dict() for k,v in df.iterrows() ] CPU times: user 51.2 s, sys: 0 ns, total: 51.2 s Wall time: 50.9 s 

Another slow method:

%time df.apply(lambda x: [x.dropna()], axis=1).to_dict(orient='rows') CPU times: user 1min 8s, sys: 880 ms, total: 1min 8s Wall time: 1min 8s 

Fastest method I could find:

%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')] CPU times: user 14.5 s, sys: 176 ms, total: 14.7 s Wall time: 14.7 s 

The format of this output is a row-oriented dictionary, you may need to make adjustments if you want the column-oriented form in the question.

Very interested if anyone finds an even faster answer to this question.

like image 98
Peter Mularien Avatar answered Sep 29 '22 16:09

Peter Mularien


First graph generate dictionaries per columns, so output is few very long dictionaries, number of dicts depends of number of columns.

I test multiple methods with perfplot and fastest method is loop by each column and remove missing values or Nones by Series.dropna or with Series.notna in boolean indexing in larger DataFrames.

Is smaller DataFrames is fastest dictionary comprehension with testing missing values by NaN != NaN trick and also testing Nones.

graph

np.random.seed(2020) import perfplot  def comp_notnull(df1):     return {k1: {k:v for k,v in v1.items() if pd.notnull(v)} for k1, v1 in df1.to_dict().items()}  def comp_NaNnotNaN_None(df1):     return {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df1.to_dict().items()}  def comp_dropna(df1):     return {k: v.dropna().to_dict() for k,v in df1.items()}  def comp_bool_indexing(df1):     return {k: v[v.notna()].to_dict() for k,v in df1.items()}  def make_df(n):     df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))     return df1  perfplot.show(     setup=make_df,     kernels=[comp_dropna, comp_bool_indexing, comp_notnull, comp_NaNnotNaN_None],     n_range=[10**k for k in range(1, 7)],     logx=True,     logy=True,     equality_check=False,     xlabel='len(df)') 

Another situtation is if generate dictionaries per rows - get list of huge amount of small dictionaries, then fastest is list comprehension with filtering NaNs and Nones:

g1

np.random.seed(2020) import perfplot   def comp_notnull1(df1):     return [{k:v for k,v in m.items() if pd.notnull(v)} for m in df1.to_dict(orient='r')]  def comp_NaNnotNaN_None1(df1):     return [{k:v for k,v in m.items() if v == v and v is not None} for m in df1.to_dict(orient='r')]  def comp_dropna1(df1):     return [v.dropna().to_dict() for k,v in df1.T.items()]  def comp_bool_indexing1(df1):     return [v[v.notna()].to_dict() for k,v in df1.T.items()]   def make_df(n):     df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))     return df1  perfplot.show(     setup=make_df,     kernels=[comp_dropna1, comp_bool_indexing1, comp_notnull1, comp_NaNnotNaN_None1],     n_range=[10**k for k in range(1, 7)],     logx=True,     logy=True,     equality_check=False,     xlabel='len(df)') 
like image 36
jezrael Avatar answered Sep 29 '22 16:09

jezrael