I am looking for a more efficient equivalent of <pre class="prettyprint"><code>df.reindex(labels).dropna(subset=[0]) </code></pre> that avoids including the NaN rows for missing label in the result rather than having to delete them after <code>reindex</code> puts them in. Equivalently I am loooking for an efficient version of <pre class="prettyprint"><code>df.loc[labels] </code></pre> that silently ignores labels that are not in <code>df.index</code>, ie the result may have fewer rows than elements of <code>labels</code>. I need something that is efficient when the numbers of rows, columns and labels are all large and there is a significant miss rate. Specifically, I'm looking for something sublinear in the length of the dataset. <hr> <h3>Update 1</h3> Here is a concrete demonstration of the issue following on from @MaxU's answer: <pre class="prettyprint"><code>In [2]: L = 10**7 ...: M = 10**4 ...: N = 10**9 ...: np.random.seed([3, 1415]) ...: df = pd.DataFrame(np.random.rand(L, 2)) ...: labels = np.random.randint(N, size=M) ...: M-len(set(labels)) ...: ...: Out[2]: 0 In [3]: %timeit df[df.index.isin(set(labels))] 904 ms ± 59.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [4]: %timeit df.loc[df.index.intersection(set(labels))] 207 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [5]: %timeit df.loc[np.intersect1d(df.index, labels)] 427 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [6]: %timeit df.loc[labels[labels<L]] 329 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [7]: %timeit df.iloc[labels[labels<L]] 161 µs ± 8.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) </code></pre> The last 2 examples are ~1000 times faster than those iterating over <code>df.index</code>. This demonstrates that <code>df.loc[labels]</code> does not iterate over the index and that dataframes have an efficient index structure, ie <code>df.index</code> does indeed index. So the question is how do I get something as efficient as <code>df.loc[labels[labels<L]]</code> when <code>df.index</code> is not a contiguous sequence of numbers. A partial solution is the the original <pre class="prettyprint"><code>In [8]: %timeit df.reindex(labels).dropna(subset=[0]) 1.81 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> That is still a ~100 times faster than the suggested solutions, but still losing an order of magnitude to what may be possible. <hr> <h3>Update 2</h3> To further demonstrate that it is possible to get sublinear performance even without assuptions on the index repeat the above with a string index <pre class="prettyprint"><code>In [16]: df.index=df.index.map(str) ...: labels = np.array(list(map(str, labels))) ...: ...: In [17]: %timeit df[df.index.isin(set(labels))] 657 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [18]: %timeit df.loc[df.index.intersection(set(labels))] 974 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [19]: %timeit df.reindex(labels).dropna() 8.7 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) </code></pre> So to be clear I am after something that is more efficient than <code>df.reindex(labels).dropna()</code>. This is already sublinear in <code>df.shape[0]</code> and makes no assumptions about the index, therefore so should the solution. The issue I want to address is that <code>df.reindex(labels)</code> will include <code>NaN</code> rows for missing labels that then need removing with <code>dropna</code>. I am after an equivalent of <code>df.reindex(labels)</code> that does not put them there in the first place, without scanning the entire <code>df.index</code> to figure out the missing labels. This must be possible at least in principle: If <code>reindex</code> can efficiently handle missing labels on the fly by inserting dummy rows, it should be possible to handle them even more efficiently on the fly by doing nothing.

Here is a small comparison for different approaches. Sample DF (shape: 10.000.000 x 2): <pre class="prettyprint"><code>np.random.seed([3, 1415]) df = pd.DataFrame(np.random.rand(10**7, 2)) labels = np.random.randint(10**9, size=10**4) In [88]: df.shape Out[88]: (10000000, 2) </code></pre> valid (existing labels): <pre class="prettyprint"><code>In [89]: (labels <= 10**7).sum() Out[89]: 1008 </code></pre> invalid (not existing labels): <pre class="prettyprint"><code>In [90]: (labels > 10**7).sum() Out[90]: 98992 </code></pre> Timings: <pre class="prettyprint"><code>In [103]: %timeit df[df.index.isin(set(labels))] 943 ms ± 7.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [104]: %timeit df.loc[df.index.intersection(set(labels))] 360 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [105]: %timeit df.loc[np.intersect1d(df.index, labels)] 513 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) </code></pre>

efficiently extract rows from a pandas DataFrame ignoring missing index labels

Q: How do you ignore an index in Python?

Practical Data Science using Python To create a DataFrame from DateTimeIndex ignoring the index, use the DateTimeIndex. to_frame() method. Set the parameter index to False to ignore the index.

Q: Is ILOC slower than LOC?

iloc[[ id ]] (with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .

Q: What does ignore index do in pandas?

Dealing with index and axisIf you want the concatenation to ignore existing indices, you can set the argument ignore_index=True . Then, the resulting DataFrame index will be labeled with 0 , …, n-1 .

Q: How do you extract the first 10 rows in pandas?

You can use df. head() to get the first N rows in Pandas DataFrame. Alternatively, you can specify a negative number within the brackets to get all the rows, excluding the last N rows.

Tags:

indexing

pandas

dataframe

I am looking for a more efficient equivalent of

df.reindex(labels).dropna(subset=[0])

that avoids including the NaN rows for missing label in the result rather than having to delete them after reindex puts them in.

Equivalently I am loooking for an efficient version of

df.loc[labels]

that silently ignores labels that are not in df.index, ie the result may have fewer rows than elements of labels.

I need something that is efficient when the numbers of rows, columns and labels are all large and there is a significant miss rate. Specifically, I'm looking for something sublinear in the length of the dataset.

Update 1

Here is a concrete demonstration of the issue following on from @MaxU's answer:

In [2]: L = 10**7
   ...: M = 10**4
   ...: N = 10**9
   ...: np.random.seed([3, 1415])
   ...: df = pd.DataFrame(np.random.rand(L, 2))
   ...: labels = np.random.randint(N, size=M)
   ...: M-len(set(labels)) 
   ...: 
   ...: 
Out[2]: 0

In [3]: %timeit df[df.index.isin(set(labels))]
904 ms ± 59.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df.loc[df.index.intersection(set(labels))]
207 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit df.loc[np.intersect1d(df.index, labels)]
427 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit df.loc[labels[labels<L]]
329 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit df.iloc[labels[labels<L]]
161 µs ± 8.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

The last 2 examples are ~1000 times faster than those iterating over df.index. This demonstrates that df.loc[labels] does not iterate over the index and that dataframes have an efficient index structure, ie df.index does indeed index.

So the question is how do I get something as efficient as df.loc[labels[labels<L]] when df.index is not a contiguous sequence of numbers. A partial solution is the the original

In [8]: %timeit df.reindex(labels).dropna(subset=[0])
1.81 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That is still a ~100 times faster than the suggested solutions, but still losing an order of magnitude to what may be possible.

Update 2

To further demonstrate that it is possible to get sublinear performance even without assuptions on the index repeat the above with a string index

In [16]: df.index=df.index.map(str)
    ...: labels = np.array(list(map(str, labels)))
    ...: 
    ...: 

In [17]: %timeit df[df.index.isin(set(labels))]
657 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %timeit df.loc[df.index.intersection(set(labels))]
974 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [19]: %timeit df.reindex(labels).dropna()
8.7 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So to be clear I am after something that is more efficient than df.reindex(labels).dropna(). This is already sublinear in df.shape[0] and makes no assumptions about the index, therefore so should the solution.

The issue I want to address is that df.reindex(labels) will include NaN rows for missing labels that then need removing with dropna. I am after an equivalent of df.reindex(labels) that does not put them there in the first place, without scanning the entire df.index to figure out the missing labels. This must be possible at least in principle: If reindex can efficiently handle missing labels on the fly by inserting dummy rows, it should be possible to handle them even more efficiently on the fly by doing nothing.

539

asked Apr 14 '18 09:04

Daniel Mahler

1 Answers

Here is a small comparison for different approaches.

Sample DF (shape: 10.000.000 x 2):

np.random.seed([3, 1415])
df = pd.DataFrame(np.random.rand(10**7, 2))
labels = np.random.randint(10**9, size=10**4)

In [88]: df.shape
Out[88]: (10000000, 2)

valid (existing labels):

In [89]: (labels <= 10**7).sum()
Out[89]: 1008

invalid (not existing labels):

In [90]: (labels > 10**7).sum()
Out[90]: 98992

Timings:

In [103]: %timeit df[df.index.isin(set(labels))]
943 ms ± 7.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [104]: %timeit df.loc[df.index.intersection(set(labels))]
360 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [105]: %timeit df.loc[np.intersect1d(df.index, labels)]
513 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

101

answered Sep 30 '22 15:09

MaxU - stop WAR against UA

Related questions
                            
                                creating a python dictionary like object from protocol buffers for use in pandas
                            
                                Multiple values in single column of a pandas DataFrame
                            
                                adding a comment with "to_csv" using pandas [duplicate]
                            
                                How to select dataframe rows according to multi-(other column)-condition on columnar groups?
                            
                                FutureWarning: elementwise comparison failed; returning scalar instead
                            
                                Count unique dates in pandas dataframe
                            
                                How to change multiple Pandas DF columns to categorical without a loop
                            
                                feather data storage library for python 'module' object has no attribute 'write_dataframe' error
                            
                                Pandas: Difference to previous value
                            
                                Overlaying vertical lines onto a plot in Altair
                            
                                Combine Sklearn TFIDF with Additional Data
                            
                                NaN values in pivot_table index causes loss of data
                            
                                Grouped bar chart from two pandas data frames
                            
                                Web Scraping with Selenium Python [Twitter + Instagram]
                            
                                Pandas - Getting a Key Error when the Key Exists
                            
                                Pandas resample data frame with fixed number of rows
                            
                                Pandas Grouper by weekday?
                            
                                AttributeError:'list' object has no attribute 'size'
                            
                                Matplotlib: How to skip a range of hours when plotting with a datetime axis?
                            
                                What does the `overwrite` parameter in Pandas DataFrame.update() function do?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With