Pandas DataFrame search is linear time or constant time?

Tags:

I have a dataframe object df of over 15000 rows like:

anime_id          name              genre    rating
1234      Kimi no nawa    Romance, Comedy     9.31
5678       Stiens;Gate             Sci-fi     8.92

And I am trying to find the row with a particular anime_id.

a_id = "5678"
temp = (df.query("anime_id == "+a_id).genre)

I just wanted to know if this search was done in constant time (like dictionaries) or linear time(like lists).

219

asked Jul 21 '17 14:07

Sayan Sil

1 Answers

This is a very interesting question!

I think it depends on the following aspects:

accessing single row by index (index is sorted and unique) should have runtime O(m) where m << n_rows

accessing single row by index (index is NOT unique and is NOT sorted) should have runtime O(n_rows)

accessing single row by index (index is NOT unique and is sorted) should have runtime O(m) where m < n_rows)

accessing row(s) (independently of an index) by boolean indexing should have runtime O(n_rows)

Demo:

index is sorted and unique:

In [49]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'))

In [50]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 27.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 331 µs per loop

In [51]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 275 µs per loop

In [52]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.84 ms per loop

In [53]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.96 ms per loop

index is NOT sorted and is NOT unique:

In [54]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5))

In [55]: %timeit df.loc[random.randint(0, 10**4)]
100 loops, best of 3: 12.3 ms per loop

In [56]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 µs per loop

In [57]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.78 ms per loop

In [58]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.93 ms per loop

index is NOT unique and is sorted:

In [64]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5)).sort_index()

In [65]: df.index.is_monotonic_increasing
Out[65]: True

In [66]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 478 µs per loop

In [67]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 µs per loop

In [68]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.81 ms per loop

In [69]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.95 ms per loop

answered Oct 19 '22 11:10

MaxU - stop WAR against UA

Related questions
                            
                                Margin (vertical space) of the first block after a text node?
                            
                                -allowProvisioningUpdates doesn't work
                            
                                iOS simulator: double click home button does not work sometimes
                            
                                RecyclerView fast scroll thumb height too small for large data set
                            
                                What is the difference between destructured assignment and "normal" assignment? [duplicate]
                            
                                Detect whether touch was Apple Pencil or finger in webview (react native)
                            
                                Python subprocess with real-time input and multiple consoles
                            
                                Paging library returns empty list initially
                            
                                Karma: use Windows' Chrome from WSL
                            
                                How to avoid VsCode Prettier to break chain functions in new lines.?
                            
                                disable printWidth on prettier
                            
                                Is it possible to use CSS variables in Twitter-bootstrap 4 $theme-colors array?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With