I have a dataframe object df
of over 15000 rows like:
anime_id name genre rating
1234 Kimi no nawa Romance, Comedy 9.31
5678 Stiens;Gate Sci-fi 8.92
And I am trying to find the row with a particular anime_id.
a_id = "5678"
temp = (df.query("anime_id == "+a_id).genre)
I just wanted to know if this search was done in constant time (like dictionaries) or linear time(like lists).
loc . I have a DataFrame with 4.8 million rows, and selecting a single row using . iloc[[ id ]] (with a single-element list) takes 489 ms, almost half a second, 1,800x times slower than the identical .
When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.
pandas contains extensive capabilities and features for working with time series data for all domains. Using the NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other Python libraries like scikits.
Most straightforward row iteration Despite its ease of use and intuitive nature, iterrows() is one of the slowest ways to iterate over rows. This article will also look at how you can substitute iterrows() for itertuples() or apply() to speed up iteration.
This is a very interesting question!
I think it depends on the following aspects:
accessing single row by index (index is sorted and unique) should have runtime O(m)
where m << n_rows
accessing single row by index (index is NOT unique and is NOT sorted) should have runtime O(n_rows)
accessing single row by index (index is NOT unique and is sorted) should have runtime O(m)
where m < n_rows
)
accessing row(s) (independently of an index) by boolean indexing should have runtime O(n_rows)
Demo:
index is sorted and unique:
In [49]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'))
In [50]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 27.65 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 331 µs per loop
In [51]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 275 µs per loop
In [52]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.84 ms per loop
In [53]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.96 ms per loop
index is NOT sorted and is NOT unique:
In [54]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5))
In [55]: %timeit df.loc[random.randint(0, 10**4)]
100 loops, best of 3: 12.3 ms per loop
In [56]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 µs per loop
In [57]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.78 ms per loop
In [58]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.93 ms per loop
index is NOT unique and is sorted:
In [64]: df = pd.DataFrame(np.random.rand(10**5,6), columns=list('abcdef'), index=np.random.randint(0, 10000, 10**5)).sort_index()
In [65]: df.index.is_monotonic_increasing
Out[65]: True
In [66]: %timeit df.loc[random.randint(0, 10**4)]
The slowest run took 9.70 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 478 µs per loop
In [67]: %timeit df.iloc[random.randint(0, 10**4)]
1000 loops, best of 3: 262 µs per loop
In [68]: %timeit df.query("a > 0.9")
100 loops, best of 3: 7.81 ms per loop
In [69]: %timeit df.loc[df.a > 0.9]
100 loops, best of 3: 2.95 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With