Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: run length of NaN holes

Tags:

python

pandas

I have hundreds of timeseries objects with 100000's of entries in each. Some percentage of the data entries are missing (NaN). It is important to my application whether those are single, scattered NaNs or long sequences of NaNs.

Therefore I would like a function for giving me the runlength of each contiguous sequence of NaN. I can do

myseries.isnull()

to get a series of bool. And I can do moving median or moving average to get an idea about the size of the data holes. However, it would be nice if there was an efficient way of getting a list of hole lenghts for a series.

I.e., it would be nice to have a myfunc so that

a = pdSeries([1, 2, 3, np.nan, 4, np.nan, np.nan, np.nan, 5, np.nan, np.nan])
myfunc(a.isnull())
==> Series([1, 3, 2])

(because there are 1, 3 and 2 NaNs, respectively)

From that, I can make histograms of hole lengths, and of the and or or of isnull of multiple series (that might be substitutes for eachother), and other nice things.

I would also like to get ideas of other ways to quantify the "clumpiness" of the data holes.

like image 402
Bjarke Ebert Avatar asked May 31 '13 12:05

Bjarke Ebert


People also ask

How do you filter out NaN values pandas?

You can filter out rows with NAN value from pandas DataFrame column string, float, datetime e.t.c by using DataFrame. dropna() and DataFrame. notnull() methods. Python doesn't support Null hence any missing data is represented as None or NaN.

Does Fillna work on NaN?

The fillna() function is used to fill NA/NaN values using the specified method. Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled.

Does Panda read NaN na?

This is what Pandas documentation gives: na_values : scalar, str, list-like, or dict, optional Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: '', '#N/A', '#N/A N/A', '#NA', '-1.


1 Answers

import pandas as pd
import numpy as np
import itertools

a = pd.Series([1, 2, 3, np.nan, 4, np.nan, np.nan, np.nan, 5, np.nan, np.nan])
len_holes = [len(list(g)) for k, g in itertools.groupby(a, lambda x: np.isnan(x)) if k]
print len_holes

results in

[1, 3, 2]
like image 137
Wouter Overmeire Avatar answered Oct 22 '22 09:10

Wouter Overmeire