Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to find which of two lists of columns of each row is true in a pandas dataframe

I'm looking for the fastest way to do the following:

We have a pd.DataFrame:

df = pd.DataFrame({
    'High': [1.3,1.2,1.1],
    'Low': [1.3,1.2,1.1],
    'High1': [1.1, 1.1, 1.1],
    'High2': [1.2, 1.2, 1.2],
    'High3': [1.3, 1.3, 1.3],
    'Low1': [1.3, 1.3, 1.3],
    'Low2': [1.2, 1.2, 1.2],
    'Low3': [1.1, 1.1, 1.1]})

That looks like:

In [4]: df
Out[4]:
   High  High1  High2  High3  Low  Low1  Low2  Low3
0   1.3    1.1    1.2    1.3  1.3   1.3   1.2   1.1
1   1.2    1.1    1.2    1.3  1.2   1.3   1.2   1.1
2   1.1    1.1    1.2    1.3  1.1   1.3   1.2   1.1

What I want to know is which one of the High1, High2, High3 float values is the first that is larger or equal to the High value. If there is none, it should be np.nan

And the same for the Low1, Low2, Low3 value, but in this case which one of them is the first that is lower or equal to the High value. If there is none, it should be np.nan

At the end I need to know which one, Low or High has come first.

One way to solve this is in a weird and not too performant way is:

df['LowIs'] = np.nan
df['HighIs'] = np.nan

for i in range(1,4):
    df['LowIs'] = np.where((np.isnan(df['LowIs'])) & (
        df['Low'] >= df['Low'+str(i)]), i, df['LowIs'])
    df['HighIs'] = np.where((np.isnan(df['HighIs'])) & (
        df['High'] <= df['High'+str(i)]), i, df['HighIs'])

df['IsFirst'] = np.where(
    df.LowIs < df.HighIs,
    'Low',
    np.where(df.LowIs > df.HighIs, 'High', 'None')
)

Which gives me:

In [8]: df
Out[8]:
   High  High1  High2  High3  Low  Low1  Low2  Low3  LowIs  HighIs IsFirst
0   1.3    1.1    1.2    1.3  1.3   1.3   1.2   1.1    1.0     3.0     Low
1   1.2    1.1    1.2    1.3  1.2   1.3   1.2   1.1    2.0     2.0    None
2   1.1    1.1    1.2    1.3  1.1   1.3   1.2   1.1    3.0     1.0    High

As I have to do this over and over again in many iterations where High/Low will be different, performance when doing this is key.

So I wouldn't mind if the High1, High2, High3 and Low1, Low2, Low3 would be in a separate DataFrame that is transposed or if it would be in a dict or whatever. So the process to prepare the data in whatever gives the best possible performance can be slow and awkward.

One solution I worked on but just couldn't get finished to work in a vectorized way and that also seems quite slow is:

df.loc[(df.index == 0), 'HighIs'] = np.where(
    df.loc[(df.index == 0), ['High1', 'High2', 'High3']] >= 1.3
)[1][0] + 1

So checking for which one of the columns it is true in that first row and then looking at the index number of np.where().

Looking forward to any suggestions and hope to learn something new! :)

like image 620
Marco Avatar asked Nov 22 '16 16:11

Marco


People also ask

How do I compare two columns in a DataFrame pandas?

By using the Where() method in NumPy, we are given the condition to compare the columns. If 'column1' is lesser than 'column2' and 'column1' is lesser than the 'column3', We print the values of 'column1'. If the condition fails, we give the value as 'NaN'. These results are stored in the new column in the dataframe.

How do I compare two rows in a DataFrame pandas?

During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.

How do I match two columns in pandas?

To find the positions of two matching columns, we first initialize a pandas dataframe with two columns of city names. Then we use where() of numpy to compare the values of two columns. This returns an array that represents the indices where the two columns have the same value.

How do I compare row values in pandas?

You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.


2 Answers

If I understood the question right, this is a semi-vectorized version:

df = pd.DataFrame({
    'High': [1.3,1.7,1.1],
    'Low': [1.3,1.2,1.1],
    'High1': [1.1, 1.1, 1.1],
    'High2': [1.2, 1.2, 1.2],
    'High3': [1.3, 1.3, 1.3],
    'Low1': [1.3, 1.3, 1.3],
    'Low2': [1.2, 1.2, 1.2],
    'Low3': [1.1, 1.1, 1.1]})

highs = ['High{:d}'.format(x) for x in range(0,4)]

for h in highs[::-1]:
    mask = df['High'] <= df[h]
    df.loc[mask, 'FirstHigh'] = h

Produces:

   High  High1  High2  High3  Low  Low1  Low2  Low3 FirstHigh
0   1.3    1.1    1.2    1.3  1.3   1.3   1.2   1.1     High3
1   1.7    1.1    1.2    1.3  1.2   1.3   1.2   1.1       NaN
2   1.1    1.1    1.2    1.3  1.1   1.3   1.2   1.1     High1

Explanation: The key here is that we iterate over the columns in reverse. That is we start at High3, check if that is greater than High, and sets FirstHigh accordingly. Then we move on to High2. If this is also greater, we simply overwrite the previous result, if not it will simply stay as is. Since we iterate in this reverse order, the result is that the first column to be higher will end up as the final result.

like image 134
Aske Doerge Avatar answered Sep 29 '22 06:09

Aske Doerge


Test your High-n columns against the High column:

a = df.iloc[:,1:4].ge(df.High, axis=0)

a
Out[67]: 
   High1  High2  High3
0  False  False   True
1  False  False  False
2   True   True   True

Now replace False with np.nan and ask for the column index of the min or max (it doesn't matter as all is True of np.nan):

a.replace(False, np.nan).idxmax(1)

0    High3
1      NaN
2    High1

Same principle for the Low columns with le as comparison operator.

like image 23
Zeugma Avatar answered Sep 29 '22 07:09

Zeugma