I've been exploring how to optimize my code and ran across pandas
.at
method. Per the documentation
Fast label-based scalar accessor
Similarly to loc, at provides label based scalar lookups. You can also set using these indexers.
So I ran some samples:
import pandas as pd
import numpy as np
from string import letters, lowercase, uppercase
lt = list(letters)
lc = list(lowercase)
uc = list(uppercase)
def gdf(rows, cols, seed=None):
"""rows and cols are what you'd pass
to pd.MultiIndex.from_product()"""
gmi = pd.MultiIndex.from_product
df = pd.DataFrame(index=gmi(rows), columns=gmi(cols))
np.random.seed(seed)
df.iloc[:, :] = np.random.rand(*df.shape)
return df
seed = [3, 1415]
df = gdf([lc, uc], [lc, uc], seed)
print df.head().T.head().T
df
looks like:
a
A B C D E
a A 0.444939 0.407554 0.460148 0.465239 0.462691
B 0.032746 0.485650 0.503892 0.351520 0.061569
C 0.777350 0.047677 0.250667 0.602878 0.570528
D 0.927783 0.653868 0.381103 0.959544 0.033253
E 0.191985 0.304597 0.195106 0.370921 0.631576
Lets use .at
and .loc
and ensure I get the same thing
print "using .loc", df.loc[('a', 'A'), ('c', 'C')]
print "using .at ", df.at[('a', 'A'), ('c', 'C')]
using .loc 0.37374090276
using .at 0.37374090276
Test speed using .loc
%%timeit
df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 180 µs per loop
Test speed using .at
%%timeit
df.at[('a', 'A'), ('c', 'C')]
The slowest run took 6.11 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8 µs per loop
This looks to be a huge speed increase. Even at the caching stage 6.11 * 8
is a lot faster than 180
What are the limitations of .at
? I'm motivated to use it. The documentation says it's similar to .loc
but it doesn't behave similarly. Example:
# small df
sdf = gdf([lc[:2]], [uc[:2]], seed)
print sdf.loc[:, :]
A B
a 0.444939 0.407554
b 0.460148 0.465239
where as print sdf.at[:, :]
results in TypeError: unhashable type
So obviously not the same even if the intent is to be similar.
That said, who can provide guidance on what can and cannot be done with the .at
method?
When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.
Pandas at[] is used to return data in a dataframe at the passed location. The passed location is in the format [position, Column Name]. This method works in a similar way to Pandas loc[ ] but at[ ] is used to return an only single value and hence works faster than it.
at selects a single scalar value in the DataFrame by label only. . iat selects a single scalar value in the DataFrame by integer location only.
loc. Access a group of rows and columns by label(s) or a boolean array. . loc[] is primarily label based, but may also be used with a boolean array.
Update: df.get_value
is deprecated as of version 0.21.0. Using df.at
or df.iat
is the recommended method going forward.
df.at
can only access a single value at a time.
df.loc
can select multiple rows and/or columns.
Note that there is also df.get_value
, which may be even quicker at accessing single values:
In [25]: %timeit df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 187 µs per loop
In [26]: %timeit df.at[('a', 'A'), ('c', 'C')]
100000 loops, best of 3: 8.33 µs per loop
In [35]: %timeit df.get_value(('a', 'A'), ('c', 'C'))
100000 loops, best of 3: 3.62 µs per loop
Under the hood, df.at[...]
calls df.get_value
, but it also does some type checking on the keys.
As you asked about the limitations of .at
, here is one thing I recently ran into (using pandas 0.22). Let's use the example from the documentation:
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], index=[4, 5, 6], columns=['A', 'B', 'C'])
df2 = df.copy()
A B C
4 0 2 3
5 0 4 1
6 10 20 30
If I now do
df.at[4, 'B'] = 100
the result looks as expected
A B C
4 0 100 3
5 0 4 1
6 10 20 30
However, when I try to do
df.at[4, 'C'] = 10.05
it seems that .at
tries to conserve the datatype (here: int
):
A B C
4 0 100 10
5 0 4 1
6 10 20 30
That seems to be a difference to .loc
:
df2.loc[4, 'C'] = 10.05
yields the desired
A B C
4 0 2 10.05
5 0 4 1.00
6 10 20 30.00
The risky thing in the example above is that it happens silently (the conversion from float
to int
). When one tries the same with strings it will throw an error:
df.at[5, 'A'] = 'a_string'
ValueError: invalid literal for int() with base 10: 'a_string'
It will work, however, if one uses a string on which int()
actually works as noted by @n1k31t4 in the comments, e.g.
df.at[5, 'A'] = '123'
A B C
4 0 2 3
5 123 4 1
6 10 20 30
Adding to the above, Pandas documentation for the at
function states:
Access a single value for a row/column label pair.
Similar to loc, in that both provide label-based lookups. Use at if you only need to get or set a single value in a DataFrame or Series.
For setting data loc
and at
are similar, for example:
df = pd.DataFrame({'A': [1,2,3], 'B': [11,22,33]}, index=[0,0,1])
Both loc
and at
will produce the same result
df.at[0, 'A'] = [101,102]
df.loc[0, 'A'] = [101,102]
A B
0 101 11
0 102 22
1 3 33
df.at[0, 'A'] = 103
df.loc[0, 'A'] = 103
A B
0 103 11
0 103 22
1 3 33
Also, for accessing a single value, both are the same
df.loc[1, 'A'] # returns a single value (<class 'numpy.int64'>)
df.at[1, 'A'] # returns a single value (<class 'numpy.int64'>)
3
However, when matching multiple values, loc
will return a group of rows/cols from the DataFrame while at
will return an array of values
df.loc[0, 'A'] # returns a Series (<class 'pandas.core.series.Series'>)
0 103
0 103
Name: A, dtype: int64
df.at[0, 'A'] # returns array of values (<class 'numpy.ndarray'>)
array([103, 103])
And more so, loc
can be used to match a group of row/cols and can be given only an index, while at
must receive the column
df.loc[0] # returns a DataFrame view (<class 'pandas.core.frame.DataFrame'>)
A B
0 103 11
0 103 22
# df.at[0] # ERROR: must receive column
.at
is an optimized data access method compared to .loc
.
.loc
of a data frame selects all the elements located by indexed_rows and labeled_columns as given in its argument. Instead, .at
selects particular element of a data frame positioned at the given indexed_row and labeled_column.
Also, .at
takes one row and one column as input argument, whereas .loc
may take multiple rows and columns. Output using .at
is a single element and using .loc
maybe a Series or a DataFrame.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With