Given a DataFrame with multiple columns, how do we select values from specific columns by row to create a new Series?
df = pd.DataFrame({"A":[1,2,3,4],
"B":[10,20,30,40],
"C":[100,200,300,400]})
columns_to_select = ["B", "A", "A", "C"]
Goal:
[10, 2, 3, 400]
One method that works is to use an apply statement.
df["cols"] = columns_to_select
df.apply(lambda x: x[x.cols], axis=1)
Unfortunately, this is not a vectorized operation and takes a long time on a large dataset. Any ideas would be appreciated.
You can use the loc and iloc functions to access columns in a Pandas DataFrame. Let's see how. If we wanted to access a certain column in our DataFrame, for example the Grades column, we could simply use the loc function and specify the name of the column in order to retrieve it.
If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc and iloc .
You can use the filter function of the pandas dataframe to select columns containing a specified string in column names. The parameter like of the . filter function defines this specific string. If a column name contains the string specified, that column will be selected and dataframe will be returned.
Select Data Using Location Index (. This means that you can use dataframe. iloc[0:1, 0:1] to select the cell value at the intersection of the first row and first column of the dataframe. You can expand the range for either the row index or column index to select more data.
Pandas approach:
In [22]: df['new'] = df.lookup(df.index, columns_to_select)
In [23]: df
Out[23]:
A B C new
0 1 10 100 10
1 2 20 200 2
2 3 30 300 3
3 4 40 400 400
NumPy way
Here's a vectorized NumPy way using advanced indexing
-
# Extract array data
In [10]: a = df.values
# Get integer based column IDs
In [11]: col_idx = np.searchsorted(df.columns, columns_to_select)
# Use NumPy's advanced indexing to extract relevant elem per row
In [12]: a[np.arange(len(col_idx)), col_idx]
Out[12]: array([ 10, 2, 3, 400])
If column names of df
are not sorted, we need to use sorter
argument with np.searchsorted
. The code to extract col_idx
for such a generic df
would be :
# https://stackoverflow.com/a/38489403/ @Divakar
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
So, col_idx
would be obtained like so -
col_idx = column_index(df, columns_to_select)
Further optimization
Profiling it revealed that the bottleneck was processing strings with np.searchsorted
, the usual NumPy weakness of not being so great with strings. So, to overcome that and using the special case scenario of column names being single letters, we could quickly convert those to numerals and then feed those to searchsorted
for much faster processing.
Thus, an optimized version of getting the integer based column IDs, for the case where the column names are single letters and sorted, would be -
def column_index_singlechar_sorted(df, query_cols):
c0 = np.fromstring(''.join(df.columns), dtype=np.uint8)
c1 = np.fromstring(''.join(query_cols), dtype=np.uint8)
return np.searchsorted(c0, c1)
This, gives us a modified version of the solution, like so -
a = df.values
col_idx = column_index_singlechar_sorted(df, columns_to_select)
out = pd.Series(a[np.arange(len(col_idx)), col_idx])
Timings -
In [149]: # Setup df with 26 uppercase column letters and many rows
...: import string
...: df = pd.DataFrame(np.random.randint(0,9,(1000000,26)))
...: s = list(string.uppercase[:df.shape[1]])
...: df.columns = s
...: idx = np.random.randint(0,df.shape[1],len(df))
...: columns_to_select = np.take(s, idx).tolist()
# With df.lookup from @MaxU's soln
In [150]: %timeit pd.Series(df.lookup(df.index, columns_to_select))
10 loops, best of 3: 76.7 ms per loop
# With proposed one from this soln
In [151]: %%timeit
...: a = df.values
...: col_idx = column_index_singlechar_sorted(df, columns_to_select)
...: out = pd.Series(a[np.arange(len(col_idx)), col_idx])
10 loops, best of 3: 59 ms per loop
Given that df.lookup
solves for a generic case, that's a probably a better choice, but the other possible optimizations as shown in this post could be handy as well!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With