Given a DataFrame with multiple columns, how do we select values from specific columns by row to create a new Series? <pre class="prettyprint lang-python prettyprint-override"><code>df = pd.DataFrame({"A":[1,2,3,4], "B":[10,20,30,40], "C":[100,200,300,400]}) columns_to_select = ["B", "A", "A", "C"] </code></pre> Goal: <code>[10, 2, 3, 400]</code> One method that works is to use an apply statement. <pre class="prettyprint lang-python prettyprint-override"><code>df["cols"] = columns_to_select df.apply(lambda x: x[x.cols], axis=1) </code></pre> Unfortunately, this is not a vectorized operation and takes a long time on a large dataset. Any ideas would be appreciated.

NumPy way Here's a vectorized NumPy way using <code>advanced indexing</code> - <pre class="prettyprint"><code># Extract array data In [10]: a = df.values # Get integer based column IDs In [11]: col_idx = np.searchsorted(df.columns, columns_to_select) # Use NumPy's advanced indexing to extract relevant elem per row In [12]: a[np.arange(len(col_idx)), col_idx] Out[12]: array([ 10, 2, 3, 400]) </code></pre> If column names of <code>df</code> are not sorted, we need to use <code>sorter</code> argument with <code>np.searchsorted</code>. The code to extract <code>col_idx</code> for such a generic <code>df</code> would be : <pre class="prettyprint"><code># https://stackoverflow.com/a/38489403/ @Divakar def column_index(df, query_cols): cols = df.columns.values sidx = np.argsort(cols) return sidx[np.searchsorted(cols,query_cols,sorter=sidx)] </code></pre> So, <code>col_idx</code> would be obtained like so - <pre class="prettyprint"><code>col_idx = column_index(df, columns_to_select) </code></pre> Further optimization Profiling it revealed that the bottleneck was processing strings with <code>np.searchsorted</code>, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to <code>searchsorted</code> for much faster processing. Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be - <pre class="prettyprint"><code>def column_index_singlechar_sorted(df, query_cols): c0 = np.fromstring(''.join(df.columns), dtype=np.uint8) c1 = np.fromstring(''.join(query_cols), dtype=np.uint8) return np.searchsorted(c0, c1) </code></pre> This, gives us a modified version of the solution, like so - <pre class="prettyprint"><code>a = df.values col_idx = column_index_singlechar_sorted(df, columns_to_select) out = pd.Series(a[np.arange(len(col_idx)), col_idx]) </code></pre> Timings - <pre class="prettyprint"><code>In [149]: # Setup df with 26 uppercase column letters and many rows ...: import string ...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26))) ...: s = list(string.uppercase[:df.shape[1]]) ...: df.columns = s ...: idx = np.random.randint(0,df.shape[1],len(df)) ...: columns_to_select = np.take(s, idx).tolist() # With df.lookup from @MaxU's soln In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select)) 10 loops, best of 3: 76.7 ms per loop # With proposed one from this soln In [151]: %%timeit ...: a = df.values ...: col_idx = column_index_singlechar_sorted(df, columns_to_select) ...: out = pd.Series(a[np.arange(len(col_idx)), col_idx]) 10 loops, best of 3: 59 ms per loop </code></pre> Given that <code>df.lookup</code> solves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!

Pandas: Select values from specific columns of a DataFrame by row

Tags:

python

indexing

pandas

numpy

Given a DataFrame with multiple columns, how do we select values from specific columns by row to create a new Series?

df = pd.DataFrame({"A":[1,2,3,4], 
                   "B":[10,20,30,40], 
                   "C":[100,200,300,400]})
columns_to_select = ["B", "A", "A", "C"]

Goal: [10, 2, 3, 400]

One method that works is to use an apply statement.

df["cols"] = columns_to_select
df.apply(lambda x: x[x.cols], axis=1)

Unfortunately, this is not a vectorized operation and takes a long time on a large dataset. Any ideas would be appreciated.

946

asked Dec 27 '17 19:12

Jason Sanchez

2 Answers

Pandas approach:

In [22]: df['new'] = df.lookup(df.index, columns_to_select)

In [23]: df
Out[23]:
   A   B    C  new
0  1  10  100   10
1  2  20  200    2
2  3  30  300    3
3  4  40  400  400

answered Oct 22 '22 12:10

MaxU - stop WAR against UA

NumPy way

Here's a vectorized NumPy way using advanced indexing -

# Extract array data
In [10]: a = df.values

# Get integer based column IDs
In [11]: col_idx = np.searchsorted(df.columns, columns_to_select)

# Use NumPy's advanced indexing to extract relevant elem per row
In [12]: a[np.arange(len(col_idx)), col_idx]
Out[12]: array([ 10,   2,   3, 400])

If column names of df are not sorted, we need to use sorter argument with np.searchsorted. The code to extract col_idx for such a generic df would be :

# https://stackoverflow.com/a/38489403/ @Divakar
def column_index(df, query_cols):
    cols = df.columns.values
    sidx = np.argsort(cols)
    return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]

So, col_idx would be obtained like so -

col_idx = column_index(df, columns_to_select)

Further optimization

Profiling it revealed that the bottleneck was processing strings with np.searchsorted, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsorted for much faster processing.

Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -

def column_index_singlechar_sorted(df, query_cols):
    c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)
    c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)
    return np.searchsorted(c0, c1)

This, gives us a modified version of the solution, like so -

a = df.values
col_idx = column_index_singlechar_sorted(df, columns_to_select)
out = pd.Series(a[np.arange(len(col_idx)), col_idx])

Timings -

In [149]: # Setup df with 26 uppercase column letters and many rows
     ...: import string
     ...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))
     ...: s = list(string.uppercase[:df.shape[1]])
     ...: df.columns = s
     ...: idx = np.random.randint(0,df.shape[1],len(df))
     ...: columns_to_select = np.take(s, idx).tolist()

# With df.lookup from @MaxU's soln
In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))
10 loops, best of 3: 76.7 ms per loop

# With proposed one from this soln
In [151]: %%timeit
     ...: a = df.values
     ...: col_idx = column_index_singlechar_sorted(df, columns_to_select)
     ...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])
10 loops, best of 3: 59 ms per loop

Given that df.lookup solves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!

answered Oct 22 '22 10:10

Divakar

Related questions
                            
                                In python pandas, how can I re-sample and interpolate a DataFrame?
                            
                                Python: Concatenate 3 arrays
                            
                                Tensorflow summary: adding a variable which does not belong to computational graph
                            
                                Is possible to keep spacy in memory to reduce the load time? [closed]
                            
                                What do the "(?<!…)" symbols mean in a Python regular expression?
                            
                                Cost of calling str() on a string?
                            
                                Python3 regex on bytes variable [duplicate]
                            
                                How to print out 'Live' mouse position coordinates using pyautogui?
                            
                                Pandas - Merge rows and add columns with 'get_dummies'
                            
                                importing numpy in hackerrank competitions
                            
                                The system cannot find the file specified with ffmpeg
                            
                                How to make a post request with the Python requests library?
                            
                                take mean of data within the same day pandas
                            
                                Django: datetime filter by date ignoring time
                            
                                Neo4j create nodes and relationships from pandas dataframe with py2neo
                            
                                How can I randomly change the values of some rows in a pandas DataFrame?
                            
                                Why do lists with the same data have different sizes?
                            
                                How to get triangle upper matrix without the diagonal using numpy
                            
                                How can i check if date is on range on Python? [duplicate]
                            
                                4D interpolation for irregular (x,y,z) grids by python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With