Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I efficiently get a numpy array for a subset of columns from my dataframe?

Motivation

I'm often answering questions in which I'm advocating converting dataframe values to an underlying numpy array for quicker calculations. However, there are some caveats to doing this and some ways that are better than others.

I'll be providing my own answer in an effort to give back to the community. I hope you guys find it useful.

Problem
Consider the dataframe df

df = pd.DataFrame(dict(A=[1, 2, 3], B=list('xyz'), C=[9, 8, 7], D=[4, 5, 6]))
print(df)

   A  B  C  D
0  1  x  9  4
1  2  y  8  5
2  3  z  7  6

with dtypes

print(df.dtypes)

A     int64
B    object
C     int64
D     int64
dtype: object

I want to create a numpy array a that consist of the values from columns A and C. Assume that there could be many columns and that I'm targeting two specific columns A and C

What I've tried

I could do:

df[['A', 'C']].values

array([[1, 9],
       [2, 8],
       [3, 7]])

This is accurate!

However, I can do it quicker with numpy

p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]

array([[1, 9],
       [2, 8],
       [3, 7]], dtype=object)

This is quicker, but inaccurate. Notice the dtype=object. I need integers!.

p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)

array([[1, 9],
       [2, 8],
       [3, 7]])

This is now correct, but I may not have known that I had all integers.

Timing

# Clear and accurate, but slower
%%timeit 
df[['A', 'C']].values
1000 loops, best of 3: 347 µs per loop

# Not accurate, but close and fast
%%timeit 
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p]
10000 loops, best of 3: 59.2 µs per loop

# Accurate for this test case and fast, needs to be more generalized.
%%timeit 
p = [df.columns.get_loc(i) for i in ['A', 'C']]
df.values[:, p].astype(int)
10000 loops, best of 3: 59.3 µs per loop
like image 338
piRSquared Avatar asked Sep 02 '25 02:09

piRSquared


1 Answers

pandas does not store a single array for the entire dataframe in the values attribute. When you call the values attribute on a dataframe, it builds the array from the underlying objects that is does store, namely the pd.Series objects. It's useful to think of a dataframe as a pd.Series of pd.Series where each column is one such pd.Series that the dataframe contains. Each column can have a dtype that is different from the rest. That is part of why dataframes are so useful. However, a numpy array must have one type. When we call the values attribute on a dataframe, it goes to each column and pulls the data from each of the respective values attributes and cobbles them together. If the columns respective dtypes are inconsistent, then the dtype of the resulting array will be forced to be object.

Option 1
Slow but accurate

a = df[['A', 'C']].values

The reason this is slow is because you are asking pandas to build you a new dataframe df[['A', 'C']] then go and build the array a by hitting each of the new dataframe's columns' values attribute.

Option 2
Find column positions then slice values

c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])

This is better because we only build the values array without rebuilding a new dataframe. I'm trusting that we are getting an array with consistent dtypes. If up casting needs to happen, I'm not dealing with it well here.

Option 3
My preferred approach
Only access the values of the columns I care about

a = np.column_stack([df[col].values for col in ['A', 'C']])

This leverages the pandas dataframe as a container of pd.Series in which I access the values attribute of only the columns I care about. I then build a new array from those arrays. If casting needs to be addressed, numpy will handle it.


All approaches yield the same result

array([[1, 9],
       [2, 8],
       [3, 7]])

Timing
small data

%%timeit 
a = df[['A', 'C']].values
1000 loops, best of 3: 338 µs per loop

%%timeit 
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
10000 loops, best of 3: 166 µs per loop

%timeit np.column_stack([df[col].values for col in ['A', 'C']])
The slowest run took 7.36 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.97 µs per loop

big data

df = pd.concat(
    [df.join(pd.DataFrame(
                np.random.randint(10, size=(3, 22)),
                columns=list(ascii_uppercase[4:])
            ))] * 10000, ignore_index=True
)


%%timeit 
a = df[['A', 'C']].values
The slowest run took 23.28 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 371 µs per loop
In [305]:

%%timeit 
c = ['A', 'C']
p = [df.columns.get_loc(i) for i in c]
a = df.values[:, p].astype(df.dtypes[c[0]])
100 loops, best of 3: 9.62 ms per loop

%timeit np.column_stack([df[col].values for col in ['A', 'C']])
The slowest run took 6.66 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 55.6 µs per loop
like image 137
piRSquared Avatar answered Sep 05 '25 07:09

piRSquared