I'm new to Python - pandas, currently trying to use it to check whether the data in DataFrame is continuous. For example:
thread sequence start end
14 1 114 1647143 1672244
15 1 115 1672244 1689707
16 1 116 1689707 1713090
17 1 118 1735352 1760283
18 1 119 1760283 1788062
19 1 120 1788062 1789885
20 1 121 1789885 1790728
Every row owns 4 columns, in general sequence should be increased with step of 1, so if everything is correct, it would look like 116,117,118 ... , like a range() function. But example here missing the row with sequence == 117.
I tried to find it, but I don't know how to do it. If I just check the sequence one by one, it would be inefficient. The desired output would be to tell the missing row or fill the missing row with NaN.
Any good tips or suggestion would be helpful.
A faster method using RangeIndex:
seq = pd.RangeIndex(df.sequence.min(), df.sequence.max())
seq[~seq.isin(df.sequence)].values
# array([117])
If you just want to get the missing sequence values you can do something like this:
>>> seq = pd.DataFrame(np.arange(df.iloc[0].sequence, df.iloc[-1].sequence))
>>> seq[~seq[0].isin(df.sequence)]
0
3 117
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With