Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficiently extract rows from a pandas DataFrame ignoring missing index labels

I am looking for a more efficient equivalent of

df.reindex(labels).dropna(subset=[0])

that avoids including the NaN rows for missing label in the result rather than having to delete them after reindex puts them in.

Equivalently I am loooking for an efficient version of

df.loc[labels]

that silently ignores labels that are not in df.index, ie the result may have fewer rows than elements of labels.

I need something that is efficient when the numbers of rows, columns and labels are all large and there is a significant miss rate. Specifically, I'm looking for something sublinear in the length of the dataset.


Update 1

Here is a concrete demonstration of the issue following on from @MaxU's answer:

In [2]: L = 10**7
   ...: M = 10**4
   ...: N = 10**9
   ...: np.random.seed([3, 1415])
   ...: df = pd.DataFrame(np.random.rand(L, 2))
   ...: labels = np.random.randint(N, size=M)
   ...: M-len(set(labels)) 
   ...: 
   ...: 
Out[2]: 0

In [3]: %timeit df[df.index.isin(set(labels))]
904 ms ± 59.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df.loc[df.index.intersection(set(labels))]
207 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit df.loc[np.intersect1d(df.index, labels)]
427 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit df.loc[labels[labels<L]]
329 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit df.iloc[labels[labels<L]]
161 µs ± 8.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The last 2 examples are ~1000 times faster than those iterating over df.index. This demonstrates that df.loc[labels] does not iterate over the index and that dataframes have an efficient index structure, ie df.index does indeed index.

So the question is how do I get something as efficient as df.loc[labels[labels<L]] when df.index is not a contiguous sequence of numbers. A partial solution is the the original

In [8]: %timeit df.reindex(labels).dropna(subset=[0])
1.81 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That is still a ~100 times faster than the suggested solutions, but still losing an order of magnitude to what may be possible.


Update 2

To further demonstrate that it is possible to get sublinear performance even without assuptions on the index repeat the above with a string index

In [16]: df.index=df.index.map(str)
    ...: labels = np.array(list(map(str, labels)))
    ...: 
    ...: 

In [17]: %timeit df[df.index.isin(set(labels))]
657 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit df.loc[df.index.intersection(set(labels))]
974 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]: %timeit df.reindex(labels).dropna()
8.7 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So to be clear I am after something that is more efficient than df.reindex(labels).dropna(). This is already sublinear in df.shape[0] and makes no assumptions about the index, therefore so should the solution.

The issue I want to address is that df.reindex(labels) will include NaN rows for missing labels that then need removing with dropna. I am after an equivalent of df.reindex(labels) that does not put them there in the first place, without scanning the entire df.index to figure out the missing labels. This must be possible at least in principle: If reindex can efficiently handle missing labels on the fly by inserting dummy rows, it should be possible to handle them even more efficiently on the fly by doing nothing.

like image 539
Daniel Mahler Avatar asked Apr 14 '18 09:04

Daniel Mahler


People also ask

How do you ignore an index in Python?

Practical Data Science using Python To create a DataFrame from DateTimeIndex ignoring the index, use the DateTimeIndex. to_frame() method. Set the parameter index to False to ignore the index.

Is ILOC slower than LOC?

iloc[[ id ]] (with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .

What does ignore index do in pandas?

Dealing with index and axisIf you want the concatenation to ignore existing indices, you can set the argument ignore_index=True . Then, the resulting DataFrame index will be labeled with 0 , …, n-1 .

How do you extract the first 10 rows in pandas?

You can use df. head() to get the first N rows in Pandas DataFrame. Alternatively, you can specify a negative number within the brackets to get all the rows, excluding the last N rows.


1 Answers

Here is a small comparison for different approaches.

Sample DF (shape: 10.000.000 x 2):

np.random.seed([3, 1415])
df = pd.DataFrame(np.random.rand(10**7, 2))
labels = np.random.randint(10**9, size=10**4)

In [88]: df.shape
Out[88]: (10000000, 2)

valid (existing labels):

In [89]: (labels <= 10**7).sum()
Out[89]: 1008

invalid (not existing labels):

In [90]: (labels > 10**7).sum()
Out[90]: 98992

Timings:

In [103]: %timeit df[df.index.isin(set(labels))]
943 ms ± 7.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [104]: %timeit df.loc[df.index.intersection(set(labels))]
360 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [105]: %timeit df.loc[np.intersect1d(df.index, labels)]
513 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 101
MaxU - stop WAR against UA Avatar answered Sep 30 '22 15:09

MaxU - stop WAR against UA