I'm often answering questions in which I'm advocating converting dataframe values to an underlying numpy array for quicker calculations. However, there are some caveats to doing this and some ways that are better than others.
I'll be providing my own answer in an effort to give back to the community. I hope you guys find it useful.
Problem
Consider the dataframe df
df = pd.DataFrame(dict(A=[1, 2, 3], B=list('xyz'), C=[9, 8, 7], D=[4, 5, 6]))
print(df)
A B C D
0 1 x 9 4
1 2 y 8 5
2 3 z 7 6
with dtypes
print(df.dtypes)
A int64
B object
C int64
D int64
dtype: object
I want to create a numpy array a
that consist of the values from columns A
and C
. Assume that there could be many columns and that I'm targeting two specific columns A
and C
What I've tried
I could do:
df[['A', 'C']].values
array([[1, 9],
[2, 8],
[3, 7]])
This is accurate!
However, I can do it quicker with numpy
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]
array([[1, 9],
[2, 8],
[3, 7]], dtype=object)
This is quicker, but inaccurate. Notice the dtype=object
. I need integers!.
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)
array([[1, 9],
[2, 8],
[3, 7]])
This is now correct, but I may not have known that I had all integers.
Timing
# Clear and accurate, but slower
%%timeit
df[['A', 'C']].values
1000 loops, best of 3: 347 µs per loop
# Not accurate, but close and fast
%%timeit
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]
10000 loops, best of 3: 59.2 µs per loop
# Accurate for this test case and fast, needs to be more generalized.
%%timeit
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)
10000 loops, best of 3: 59.3 µs per loop
pandas
does not store a single array for the entire dataframe in the values
attribute. When you call the values
attribute on a dataframe, it builds the array from the underlying objects that is does store, namely the pd.Series
objects. It's useful to think of a dataframe as a pd.Series
of pd.Series
where each column is one such pd.Series
that the dataframe contains. Each column can have a dtype
that is different from the rest. That is part of why dataframes are so useful. However, a numpy array must have one type. When we call the values
attribute on a dataframe, it goes to each column and pulls the data from each of the respective values
attributes and cobbles them together. If the columns respective dtypes are inconsistent, then the dtype
of the resulting array will be forced to be object
.
Option 1
Slow but accurate
a = df[['A', 'C']].values
The reason this is slow is because you are asking pandas to build you a new dataframe df[['A', 'C']]
then go and build the array a
by hitting each of the new dataframe's columns' values attribute.
Option 2
Find column positions then slice values
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
This is better because we only build the values array without rebuilding a new dataframe. I'm trusting that we are getting an array with consistent dtypes. If up casting needs to happen, I'm not dealing with it well here.
Option 3
My preferred approach
Only access the values of the columns I care about
a = np.column_stack([df[col].values for col in ['A', 'C']])
This leverages the pandas dataframe as a container of pd.Series
in which I access the values
attribute of only the columns I care about. I then build a new array from those arrays. If casting needs to be addressed, numpy will handle it.
All approaches yield the same result
array([[1, 9],
[2, 8],
[3, 7]])
Timing
small data
%%timeit
a = df[['A', 'C']].values
1000 loops, best of 3: 338 µs per loop
%%timeit
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
10000 loops, best of 3: 166 µs per loop
%timeit np.column_stack([df[col].values for col in ['A', 'C']])
The slowest run took 7.36 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.97 µs per loop
big data
df = pd.concat(
[df.join(pd.DataFrame(
np.random.randint(10, size=(3, 22)),
columns=list(ascii_uppercase[4:])
))] * 10000, ignore_index=True
)
%%timeit
a = df[['A', 'C']].values
The slowest run took 23.28 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 371 µs per loop
In [305]:
%%timeit
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
100 loops, best of 3: 9.62 ms per loop
%timeit np.column_stack([df[col].values for col in ['A', 'C']])
The slowest run took 6.66 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 55.6 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With