make pandas DataFrame to a dict and dropna

Tags:

pandas

I have some pandas DataFrame with NaNs in it. Like this:

import pandas as pd import numpy as np raw_data={'A':{1:2,2:3,3:4},'B':{1:np.nan,2:44,3:np.nan}} data=pd.DataFrame(raw_data) >>> data    A   B 1  2 NaN 2  3  44 3  4 NaN

Now I want to make a dict out of it and at the same time remove the NaNs. The result should look like this:

{'A': {1: 2, 2: 3, 3: 4}, 'B': {2: 44.0}}

But using pandas to_dict function gives me a result like this:

>>> data.to_dict() {'A': {1: 2, 2: 3, 3: 4}, 'B': {1: nan, 2: 44.0, 3: nan}}

So how to make a dict out of the DataFrame and get rid of the NaNs ?

506

asked Sep 25 '14 07:09

2 Answers

There are many ways you could accomplish this, I spent some time evaluating performance on a not-so-large (70k) dataframe. Although @der_die_das_jojo's answer is functional, it's also pretty slow.

The answer suggested by this question actually turns out to be about 5x faster on a large dataframe.

On my test dataframe (df):

Above method:

%time [ v.dropna().to_dict() for k,v in df.iterrows() ] CPU times: user 51.2 s, sys: 0 ns, total: 51.2 s Wall time: 50.9 s

Another slow method:

%time df.apply(lambda x: [x.dropna()], axis=1).to_dict(orient='rows') CPU times: user 1min 8s, sys: 880 ms, total: 1min 8s Wall time: 1min 8s

Fastest method I could find:

%time [ {k:v for k,v in m.items() if pd.notnull(v)} for m in df.to_dict(orient='rows')] CPU times: user 14.5 s, sys: 176 ms, total: 14.7 s Wall time: 14.7 s

The format of this output is a row-oriented dictionary, you may need to make adjustments if you want the column-oriented form in the question.

Very interested if anyone finds an even faster answer to this question.

answered Sep 29 '22 16:09

Peter Mularien

First graph generate dictionaries per columns, so output is few very long dictionaries, number of dicts depends of number of columns.

I test multiple methods with perfplot and fastest method is loop by each column and remove missing values or Nones by Series.dropna or with Series.notna in boolean indexing in larger DataFrames.

Is smaller DataFrames is fastest dictionary comprehension with testing missing values by NaN != NaN trick and also testing Nones.

graph

np.random.seed(2020) import perfplot  def comp_notnull(df1):     return {k1: {k:v for k,v in v1.items() if pd.notnull(v)} for k1, v1 in df1.to_dict().items()}  def comp_NaNnotNaN_None(df1):     return {k1: {k:v for k,v in v1.items() if v == v and v is not None} for k1, v1 in df1.to_dict().items()}  def comp_dropna(df1):     return {k: v.dropna().to_dict() for k,v in df1.items()}  def comp_bool_indexing(df1):     return {k: v[v.notna()].to_dict() for k,v in df1.items()}  def make_df(n):     df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))     return df1  perfplot.show(     setup=make_df,     kernels=[comp_dropna, comp_bool_indexing, comp_notnull, comp_NaNnotNaN_None],     n_range=[10**k for k in range(1, 7)],     logx=True,     logy=True,     equality_check=False,     xlabel='len(df)')

Another situtation is if generate dictionaries per rows - get list of huge amount of small dictionaries, then fastest is list comprehension with filtering NaNs and Nones:

np.random.seed(2020) import perfplot   def comp_notnull1(df1):     return [{k:v for k,v in m.items() if pd.notnull(v)} for m in df1.to_dict(orient='r')]  def comp_NaNnotNaN_None1(df1):     return [{k:v for k,v in m.items() if v == v and v is not None} for m in df1.to_dict(orient='r')]  def comp_dropna1(df1):     return [v.dropna().to_dict() for k,v in df1.T.items()]  def comp_bool_indexing1(df1):     return [v[v.notna()].to_dict() for k,v in df1.T.items()]   def make_df(n):     df1 = pd.DataFrame(np.random.choice([1,2, np.nan], size=(n, 5)), columns=list('ABCDE'))     return df1  perfplot.show(     setup=make_df,     kernels=[comp_dropna1, comp_bool_indexing1, comp_notnull1, comp_NaNnotNaN_None1],     n_range=[10**k for k in range(1, 7)],     logx=True,     logy=True,     equality_check=False,     xlabel='len(df)')

answered Sep 29 '22 16:09

jezrael

Related questions
                            
                                In Django, can you add a method to querysets?
                            
                                Python implementation of the Wilson Score Interval?
                            
                                Is there an equivalent of Python's `pass` in c++ std11?
                            
                                How to annotate a generator in python3?
                            
                                "unpacking" a passed dictionary into the function's name space in Python?
                            
                                What is the pythonic way to read CSV file data as rows of namedtuples?
                            
                                How to check the number of currently running threads in Python? [duplicate]
                            
                                Django: relation "django_site" does not exist
                            
                                How to pip install a local python package?
                            
                                How to use SequenceMatcher to find similarity between two strings?
                            
                                What does this mean exit (main())
                            
                                Python Itertools.Permutations()
                            
                                How to merge two json string in Python?
                            
                                Find how many lines in string
                            
                                Gmail API Error from Code Sample - a bytes-like object is required, not 'str'
                            
                                What's the difference between casting and coercion in Python?
                            
                                Compare string with all values in list
                            
                                ubuntu ImportError: cannot import name MAXREPEAT
                            
                                Why does checking a variable against multiple values with `OR` only check the first value? [duplicate]
                            
                                TypeError with ufunc bitwise_xor

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

make pandas DataFrame to a dict and dropna

Tags:

python

pandas

der_die_das_jojo

People also ask

2 Answers

Peter Mularien

jezrael

Recent Activity

Donate For Us