I have some pandas DataFrame with NaNs in it. Like this:
import pandas as pd import numpy as np raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}} data=pd.DataFrame(raw_data) >>> data A B 1 2 NaN 2 3 44 3 4 NaN
Now I want to make a dict out of it and at the same time remove the NaNs. The result should look like this:
{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}
But using pandas to_dict function gives me a result like this:
>>> data.to_dict() {'A': {1: 2, 2: 3, 3: 4}, 'B': {1: nan, 2: 44.0, 3: nan}}
So how to make a dict out of the DataFrame and get rid of the NaNs ?
To convert pandas DataFrame to Dictionary object, use to_dict() method, this takes orient as dict by default which returns the DataFrame in format {column -> {index -> value}} . When no orient is specified, to_dict() returns in this format.
Overview: A pandas DataFrame can be converted into a Python dictionary using the DataFrame instance method to_dict(). The output can be specified of various orientations using the parameter orient.
The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause. 'left', 'right' and 'inner' joins are all possible.
Definition and Usage. The dropna() method removes the rows that contains NULL values. The dropna() method returns a new DataFrame object unless the inplace parameter is set to True , in that case the dropna() method does the removing in the original DataFrame instead.
There are many ways you could accomplish this, I spent some time evaluating performance on a not-so-large (70k) dataframe. Although @der_die_das_jojo's answer is functional, it's also pretty slow.
The answer suggested by this question actually turns out to be about 5x faster on a large dataframe.
On my test dataframe (df
):
Above method:
%time [ v.dropna().to_dict() for k,v in df.iterrows() ] CPU times: user 51.2 s, sys: 0 ns, total: 51.2 s Wall time: 50.9 s
Another slow method:
%time df.apply(lambda x: [x.dropna()], axis=1).to_dict(orient='rows') CPU times: user 1min 8s, sys: 880 ms, total: 1min 8s Wall time: 1min 8s
Fastest method I could find:
%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')] CPU times: user 14.5 s, sys: 176 ms, total: 14.7 s Wall time: 14.7 s
The format of this output is a row-oriented dictionary, you may need to make adjustments if you want the column-oriented form in the question.
Very interested if anyone finds an even faster answer to this question.
First graph generate dictionaries per columns, so output is few very long dictionaries, number of dicts depends of number of columns.
I test multiple methods with perfplot
and fastest method is loop by each column and remove missing values or None
s by Series.dropna
or with Series.notna
in boolean indexing
in larger DataFrames.
Is smaller DataFrames is fastest dictionary comprehension with testing missing values by NaN != NaN
trick and also testing None
s.
np.random.seed(2020) import perfplot def comp_notnull(df1): return {k1: {k:v for k,v in v1.items() if pd.notnull(v)} for k1, v1 in df1.to_dict().items()} def comp_NaNnotNaN_None(df1): return {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df1.to_dict().items()} def comp_dropna(df1): return {k: v.dropna().to_dict() for k,v in df1.items()} def comp_bool_indexing(df1): return {k: v[v.notna()].to_dict() for k,v in df1.items()} def make_df(n): df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE')) return df1 perfplot.show( setup=make_df, kernels=[comp_dropna, comp_bool_indexing, comp_notnull, comp_NaNnotNaN_None], n_range=[10**k for k in range(1, 7)], logx=True, logy=True, equality_check=False, xlabel='len(df)')
Another situtation is if generate dictionaries per rows - get list of huge amount of small dictionaries, then fastest is list comprehension with filtering NaNs and Nones:
np.random.seed(2020) import perfplot def comp_notnull1(df1): return [{k:v for k,v in m.items() if pd.notnull(v)} for m in df1.to_dict(orient='r')] def comp_NaNnotNaN_None1(df1): return [{k:v for k,v in m.items() if v == v and v is not None} for m in df1.to_dict(orient='r')] def comp_dropna1(df1): return [v.dropna().to_dict() for k,v in df1.T.items()] def comp_bool_indexing1(df1): return [v[v.notna()].to_dict() for k,v in df1.T.items()] def make_df(n): df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE')) return df1 perfplot.show( setup=make_df, kernels=[comp_dropna1, comp_bool_indexing1, comp_notnull1, comp_NaNnotNaN_None1], n_range=[10**k for k in range(1, 7)], logx=True, logy=True, equality_check=False, xlabel='len(df)')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With