Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove blank key entries in Pandas pd.to_dict

Pandas has a very nice feature to export our dataframes to a list of dicts via pd.to_dict('records').

For example:

d = pd.DataFrame({'a':[1,2,3], 'b':['a', 'b', None]}) 
d.to_dict('records')

returns

[{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3, 'b': None}]

For my use case, I would prefer the following entry:

[{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3}]

where you can see the key of b is removed from the third entry. This is default behavior in R when using jsonlite, and am wondering how I would remove keys with missing values from each entry.

like image 434
Btibert3 Avatar asked Jan 24 '23 23:01

Btibert3


2 Answers

Update: using list comprehension and itertuples with nested dict comprehension. It is the fastest

l = [{k: v for k, v in tup._asdict().items() if v is not None} 
                                       for tup in d.itertuples(index=False)]

Out[74]: [{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3}]

Timing:

d1 = pd.concat([d]*5000, ignore_index=True)

In [76]: %timeit [{k: v for k, v in tup._asdict().items() if v is not None} for
    ...:  tup in d1.itertuples(index=False)]
442 ms ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Another way is using list comprehension and iterrows

l = [row.dropna().to_dict() for k, row in d.iterrows()]

Out[33]: [{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3}]

iterrows has reputation of slow performance. I tested on sample of 15000 rows to compare against stack

In [49]: d1 = pd.concat([d]*5000, ignore_index=True)

In [50]: %timeit d1.stack().groupby(level=0).agg(lambda x : x.reset_index(level
    ...: =0,drop=True).to_dict()).tolist()
7.52 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [51]: %timeit [row.dropna().to_dict() for k, row in d1.iterrows()]
6.45 s ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Interesting result. However, I think if the data is bigger, it will be slower than stack

like image 37
Andy L. Avatar answered Feb 13 '23 01:02

Andy L.


We can do stack

l=d.stack().groupby(level=0).agg(lambda x : x.reset_index(level=0,drop=True).to_dict()).tolist()
Out[142]: [{'a': 1, 'b': 'a'}, {'a': 2, 'b': 'b'}, {'a': 3}]
like image 152
BENY Avatar answered Feb 13 '23 03:02

BENY