Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group consecutive NaN values from a Pandas Series in a set of slices?

I want to merge consecutive NaN values into slices. Is there a simple way of doing this with numpy or pandas?

l = [
    (996, np.nan), (997, np.nan), (998, np.nan),
    (999, -47.3), (1000, -72.5), (1100, -97.7),
    (1200, np.nan), (1201, np.nan), (1205, -97.8),
    (1300, np.nan), (1302, np.nan), (1305, -97.9),
    (1400, np.nan), (1405, -97.10), (1408, np.nan)
]
l = pd.Series(dict(l))

Expected result:

[
    (slice(996, 999, None), array([nan, nan, nan])),
    (999, -47.3),
    (1000, -72.5),
    (1100, -97.7),
    (slice(1200, 1202, None), array([nan, nan])),
    (1205, -97.8),
    (slice(1300, 1301, None), array([nan])),
    (slice(1302, 1303, None), array([nan])),
    (1305, -97.9),
    (slice(1400, 1401, None), array([nan])),
    (1405, -97.1),
    (slice(1408, 1409, None), array([nan]))
]

A numpy array with two dimensions would be OK as well, rather than a list of tuples

Update 2019/05/31: I have just realised that if I just use a dictionary instead of a Pandas Series the algorythm is much more efficient

like image 615
ChesuCR Avatar asked May 23 '19 10:05

ChesuCR


People also ask

How NaN count in pandas series?

To count the NaN values in a column in a Pandas DataFrame, we can use the isna() method with sum.

How do you calculate NaN values in a series?

Counting NaN in the entire DataFrame : To count NaN in the entire dataset, we just need to call the sum() function twice – once for getting the count in each column and again for finding the total sum of all the columns.


1 Answers

What you want is full or corner cases, nan equality, first element of each pair being a slice or a single value, second being a np.array or a single value.

For so complex requirements, I would just rely on a plain Python non vectorized way:

def trans(ser):
    def build(last, cur, val):
        if cur == last + 1:
            if np.isnan(val):
                return (slice(last, cur), np.array([np.nan]))
            else:
                return (last, val)
        else:
            return (slice(last, cur), np.array([val] * (cur - last)))
    last = ser.iloc[0]
    old = last_index = ser.index[0]
    resul = []
    for i in ser.index[1:]:
        val = ser[i]
        if ((val != last) and not(np.isnan(val) and np.isnan(last))) \
           or i != old + 1:
            resul.append(build(last_index, old + 1, last))
            last_index = i
            last = val
        old = i
    resul.append(build(last_index, old+1, last))
    return resul

It gives something close to the expected result:

[(slice(996, 999, None), array([nan, nan, nan])),
 (999, -47.3),
 (1000, -72.5),
 (1100, -97.7),
 (slice(1200, 1202, None), array([nan, nan])),
 (1205, -97.8),
 (slice(1300, 1301, None), array([nan])),
 (slice(1302, 1303, None), array([nan])),
 (1305, -97.9),
 (slice(1400, 1401, None), array([nan])),
 (1405, -97.1),
 (slice(1408, 1409, None), array([nan]))]
like image 69
Serge Ballesta Avatar answered Oct 04 '22 16:10

Serge Ballesta