I'm looking for the fastest way to do the following: We have a pd.DataFrame: <pre class="prettyprint"><code>df = pd.DataFrame({ 'High': [1.3,1.2,1.1], 'Low': [1.3,1.2,1.1], 'High1': [1.1, 1.1, 1.1], 'High2': [1.2, 1.2, 1.2], 'High3': [1.3, 1.3, 1.3], 'Low1': [1.3, 1.3, 1.3], 'Low2': [1.2, 1.2, 1.2], 'Low3': [1.1, 1.1, 1.1]}) </code></pre> That looks like: <pre class="prettyprint"><code>In [4]: df Out[4]: High High1 High2 High3 Low Low1 Low2 Low3 0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1 1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.1 2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1 </code></pre> What I want to know is which one of the High1, High2, High3 float values is the first that is larger or equal to the High value. If there is none, it should be np.nan And the same for the Low1, Low2, Low3 value, but in this case which one of them is the first that is lower or equal to the High value. If there is none, it should be np.nan At the end I need to know which one, Low or High has come first. One way to solve this is in a weird and not too performant way is: <pre class="prettyprint"><code>df['LowIs'] = np.nan df['HighIs'] = np.nan for i in range(1,4): df['LowIs'] = np.where((np.isnan(df['LowIs'])) & ( df['Low'] >= df['Low'+str(i)]), i, df['LowIs']) df['HighIs'] = np.where((np.isnan(df['HighIs'])) & ( df['High'] <= df['High'+str(i)]), i, df['HighIs']) df['IsFirst'] = np.where( df.LowIs < df.HighIs, 'Low', np.where(df.LowIs > df.HighIs, 'High', 'None') ) </code></pre> Which gives me: <pre class="prettyprint"><code>In [8]: df Out[8]: High High1 High2 High3 Low Low1 Low2 Low3 LowIs HighIs IsFirst 0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1 1.0 3.0 Low 1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.1 2.0 2.0 None 2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1 3.0 1.0 High </code></pre> As I have to do this over and over again in many iterations where High/Low will be different, performance when doing this is key. So I wouldn't mind if the High1, High2, High3 and Low1, Low2, Low3 would be in a separate DataFrame that is transposed or if it would be in a dict or whatever. So the process to prepare the data in whatever gives the best possible performance can be slow and awkward. One solution I worked on but just couldn't get finished to work in a vectorized way and that also seems quite slow is: <pre class="prettyprint"><code>df.loc[(df.index == 0), 'HighIs'] = np.where( df.loc[(df.index == 0), ['High1', 'High2', 'High3']] >= 1.3 )[1][0] + 1 </code></pre> So checking for which one of the columns it is true in that first row and then looking at the index number of np.where(). Looking forward to any suggestions and hope to learn something new! :)

If I understood the question right, this is a semi-vectorized version: <pre class="prettyprint"><code>df = pd.DataFrame({ 'High': [1.3,1.7,1.1], 'Low': [1.3,1.2,1.1], 'High1': [1.1, 1.1, 1.1], 'High2': [1.2, 1.2, 1.2], 'High3': [1.3, 1.3, 1.3], 'Low1': [1.3, 1.3, 1.3], 'Low2': [1.2, 1.2, 1.2], 'Low3': [1.1, 1.1, 1.1]}) highs = ['High{:d}'.format(x) for x in range(0,4)] for h in highs[::-1]: mask = df['High'] <= df[h] df.loc[mask, 'FirstHigh'] = h </code></pre> Produces: <pre class="prettyprint"><code> High High1 High2 High3 Low Low1 Low2 Low3 FirstHigh 0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1 High3 1 1.7 1.1 1.2 1.3 1.2 1.3 1.2 1.1 NaN 2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1 High1 </code></pre> Explanation: The key here is that we iterate over the columns in reverse. That is we start at <code>High3</code>, check if that is greater than <code>High</code>, and sets <code>FirstHigh</code> accordingly. Then we move on to <code>High2</code>. If this is also greater, we simply overwrite the previous result, if not it will simply stay as is. Since we iterate in this reverse order, the result is that the first column to be higher will end up as the final result.

Test your High-n columns against the High column: <pre class="prettyprint"><code>a = df.iloc[:,1:4].ge(df.High, axis=0) a Out[67]: High1 High2 High3 0 False False True 1 False False False 2 True True True </code></pre> Now replace False with <code>np.nan</code> and ask for the column index of the min or max (it doesn't matter as all is True of np.nan): <pre class="prettyprint"><code>a.replace(False, np.nan).idxmax(1) 0 High3 1 NaN 2 High1 </code></pre> Same principle for the Low columns with <code>le</code> as comparison operator.

Fastest way to find which of two lists of columns of each row is true in a pandas dataframe

Tags:

performance

python

pandas

vectorization

numpy

I'm looking for the fastest way to do the following:

We have a pd.DataFrame:

df = pd.DataFrame({
    'High': [1.3,1.2,1.1],
    'Low': [1.3,1.2,1.1],
    'High1': [1.1, 1.1, 1.1],
    'High2': [1.2, 1.2, 1.2],
    'High3': [1.3, 1.3, 1.3],
    'Low1': [1.3, 1.3, 1.3],
    'Low2': [1.2, 1.2, 1.2],
    'Low3': [1.1, 1.1, 1.1]})

That looks like:

In [4]: df
Out[4]:
   High  High1  High2  High3  Low  Low1  Low2  Low3
0   1.3    1.1    1.2    1.3  1.3   1.3   1.2   1.1
1   1.2    1.1    1.2    1.3  1.2   1.3   1.2   1.1
2   1.1    1.1    1.2    1.3  1.1   1.3   1.2   1.1

What I want to know is which one of the High1, High2, High3 float values is the first that is larger or equal to the High value. If there is none, it should be np.nan

And the same for the Low1, Low2, Low3 value, but in this case which one of them is the first that is lower or equal to the High value. If there is none, it should be np.nan

At the end I need to know which one, Low or High has come first.

One way to solve this is in a weird and not too performant way is:

df['LowIs'] = np.nan
df['HighIs'] = np.nan

for i in range(1,4):
    df['LowIs'] = np.where((np.isnan(df['LowIs'])) & (
        df['Low'] >= df['Low'+str(i)]), i, df['LowIs'])
    df['HighIs'] = np.where((np.isnan(df['HighIs'])) & (
        df['High'] <= df['High'+str(i)]), i, df['HighIs'])

df['IsFirst'] = np.where(
    df.LowIs < df.HighIs,
    'Low',
    np.where(df.LowIs > df.HighIs, 'High', 'None')
)

Which gives me:

In [8]: df
Out[8]:
   High  High1  High2  High3  Low  Low1  Low2  Low3  LowIs  HighIs IsFirst
0   1.3    1.1    1.2    1.3  1.3   1.3   1.2   1.1    1.0     3.0     Low
1   1.2    1.1    1.2    1.3  1.2   1.3   1.2   1.1    2.0     2.0    None
2   1.1    1.1    1.2    1.3  1.1   1.3   1.2   1.1    3.0     1.0    High

As I have to do this over and over again in many iterations where High/Low will be different, performance when doing this is key.

So I wouldn't mind if the High1, High2, High3 and Low1, Low2, Low3 would be in a separate DataFrame that is transposed or if it would be in a dict or whatever. So the process to prepare the data in whatever gives the best possible performance can be slow and awkward.

One solution I worked on but just couldn't get finished to work in a vectorized way and that also seems quite slow is:

df.loc[(df.index == 0), 'HighIs'] = np.where(
    df.loc[(df.index == 0), ['High1', 'High2', 'High3']] >= 1.3
)[1][0] + 1

So checking for which one of the columns it is true in that first row and then looking at the index number of np.where().

Looking forward to any suggestions and hope to learn something new! :)

620

asked Nov 22 '16 16:11

Marco

2 Answers

If I understood the question right, this is a semi-vectorized version:

df = pd.DataFrame({
    'High': [1.3,1.7,1.1],
    'Low': [1.3,1.2,1.1],
    'High1': [1.1, 1.1, 1.1],
    'High2': [1.2, 1.2, 1.2],
    'High3': [1.3, 1.3, 1.3],
    'Low1': [1.3, 1.3, 1.3],
    'Low2': [1.2, 1.2, 1.2],
    'Low3': [1.1, 1.1, 1.1]})

highs = ['High{:d}'.format(x) for x in range(0,4)]

for h in highs[::-1]:
    mask = df['High'] <= df[h]
    df.loc[mask, 'FirstHigh'] = h

Produces:

   High  High1  High2  High3  Low  Low1  Low2  Low3 FirstHigh
0   1.3    1.1    1.2    1.3  1.3   1.3   1.2   1.1     High3
1   1.7    1.1    1.2    1.3  1.2   1.3   1.2   1.1       NaN
2   1.1    1.1    1.2    1.3  1.1   1.3   1.2   1.1     High1

Explanation: The key here is that we iterate over the columns in reverse. That is we start at High3, check if that is greater than High, and sets FirstHigh accordingly. Then we move on to High2. If this is also greater, we simply overwrite the previous result, if not it will simply stay as is. Since we iterate in this reverse order, the result is that the first column to be higher will end up as the final result.

134

answered Sep 29 '22 06:09

Aske Doerge

Test your High-n columns against the High column:

a = df.iloc[:,1:4].ge(df.High, axis=0)

a
Out[67]: 
   High1  High2  High3
0  False  False   True
1  False  False  False
2   True   True   True

Now replace False with np.nan and ask for the column index of the min or max (it doesn't matter as all is True of np.nan):

a.replace(False, np.nan).idxmax(1)

0    High3
1      NaN
2    High1

Same principle for the Low columns with le as comparison operator.

answered Sep 29 '22 07:09

Zeugma

Related questions
                            
                                Tidy data from multilevel Excel file via pandas
                            
                                numpy: detect consecutive 1 in an array
                            
                                Passing varargs to Java from Python using Py4j
                            
                                Python multiple variable assignment confusion
                            
                                Cython parallel loop problems
                            
                                Is there a way to efficiently vectorize Tensorflow ops on images?
                            
                                Django: How can I use a variable inside an if statement in the template?
                            
                                Difference between LinearRegression() and Ridge(alpha=0)
                            
                                Conditional data selection with text string data in pandas dataframe
                            
                                Python: Adding Float formatted time to datetime value
                            
                                Docker-compose MemoryError
                            
                                Why does pandas read_csv not support multiple comments (#,@,...)?
                            
                                Processing an uploaded file using Django
                            
                                Implementing Discrete Gaussian Kernel in Python?
                            
                                What is the type returned by quandl.get?
                            
                                Python: List all the file names in a directory and its subdirectories and then print the results in a txt file
                            
                                Reading and parsing email from Gmail using C#, C++ or Python
                            
                                Bokeh: ValueError: Out of range float values are not JSON compliant
                            
                                Rendering template gives "jinja2.exceptions.UndefinedError: 'form' is undefined"
                            
                                How does `nullable=False` work in SQLAlchemy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With