I'm looking for the fastest way to do the following:
We have a pd.DataFrame:
df = pd.DataFrame({
'High': [1.3,1.2,1.1],
'Low': [1.3,1.2,1.1],
'High1': [1.1, 1.1, 1.1],
'High2': [1.2, 1.2, 1.2],
'High3': [1.3, 1.3, 1.3],
'Low1': [1.3, 1.3, 1.3],
'Low2': [1.2, 1.2, 1.2],
'Low3': [1.1, 1.1, 1.1]})
That looks like:
In [4]: df
Out[4]:
High High1 High2 High3 Low Low1 Low2 Low3
0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1
1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.1
2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1
What I want to know is which one of the High1, High2, High3 float values is the first that is larger or equal to the High value. If there is none, it should be np.nan
And the same for the Low1, Low2, Low3 value, but in this case which one of them is the first that is lower or equal to the High value. If there is none, it should be np.nan
At the end I need to know which one, Low or High has come first.
One way to solve this is in a weird and not too performant way is:
df['LowIs'] = np.nan
df['HighIs'] = np.nan
for i in range(1,4):
df['LowIs'] = np.where((np.isnan(df['LowIs'])) & (
df['Low'] >= df['Low'+str(i)]), i, df['LowIs'])
df['HighIs'] = np.where((np.isnan(df['HighIs'])) & (
df['High'] <= df['High'+str(i)]), i, df['HighIs'])
df['IsFirst'] = np.where(
df.LowIs < df.HighIs,
'Low',
np.where(df.LowIs > df.HighIs, 'High', 'None')
)
Which gives me:
In [8]: df
Out[8]:
High High1 High2 High3 Low Low1 Low2 Low3 LowIs HighIs IsFirst
0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1 1.0 3.0 Low
1 1.2 1.1 1.2 1.3 1.2 1.3 1.2 1.1 2.0 2.0 None
2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1 3.0 1.0 High
As I have to do this over and over again in many iterations where High/Low will be different, performance when doing this is key.
So I wouldn't mind if the High1, High2, High3 and Low1, Low2, Low3 would be in a separate DataFrame that is transposed or if it would be in a dict or whatever. So the process to prepare the data in whatever gives the best possible performance can be slow and awkward.
One solution I worked on but just couldn't get finished to work in a vectorized way and that also seems quite slow is:
df.loc[(df.index == 0), 'HighIs'] = np.where(
df.loc[(df.index == 0), ['High1', 'High2', 'High3']] >= 1.3
)[1][0] + 1
So checking for which one of the columns it is true in that first row and then looking at the index number of np.where().
Looking forward to any suggestions and hope to learn something new! :)
By using the Where() method in NumPy, we are given the condition to compare the columns. If 'column1' is lesser than 'column2' and 'column1' is lesser than the 'column3', We print the values of 'column1'. If the condition fails, we give the value as 'NaN'. These results are stored in the new column in the dataframe.
During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.
To find the positions of two matching columns, we first initialize a pandas dataframe with two columns of city names. Then we use where() of numpy to compare the values of two columns. This returns an array that represents the indices where the two columns have the same value.
You can use the DataFrame. diff() function to find the difference between two rows in a pandas DataFrame. where: periods: The number of previous rows for calculating the difference.
If I understood the question right, this is a semi-vectorized version:
df = pd.DataFrame({
'High': [1.3,1.7,1.1],
'Low': [1.3,1.2,1.1],
'High1': [1.1, 1.1, 1.1],
'High2': [1.2, 1.2, 1.2],
'High3': [1.3, 1.3, 1.3],
'Low1': [1.3, 1.3, 1.3],
'Low2': [1.2, 1.2, 1.2],
'Low3': [1.1, 1.1, 1.1]})
highs = ['High{:d}'.format(x) for x in range(0,4)]
for h in highs[::-1]:
mask = df['High'] <= df[h]
df.loc[mask, 'FirstHigh'] = h
Produces:
High High1 High2 High3 Low Low1 Low2 Low3 FirstHigh
0 1.3 1.1 1.2 1.3 1.3 1.3 1.2 1.1 High3
1 1.7 1.1 1.2 1.3 1.2 1.3 1.2 1.1 NaN
2 1.1 1.1 1.2 1.3 1.1 1.3 1.2 1.1 High1
Explanation:
The key here is that we iterate over the columns in reverse. That is we start at High3
, check if that is greater than High
, and sets FirstHigh
accordingly. Then we move on to High2
. If this is also greater, we simply overwrite the previous result, if not it will simply stay as is. Since we iterate in this reverse order, the result is that the first column to be higher will end up as the final result.
Test your High-n columns against the High column:
a = df.iloc[:,1:4].ge(df.High, axis=0)
a
Out[67]:
High1 High2 High3
0 False False True
1 False False False
2 True True True
Now replace False with np.nan
and ask for the column index of the min or max (it doesn't matter as all is True of np.nan):
a.replace(False, np.nan).idxmax(1)
0 High3
1 NaN
2 High1
Same principle for the Low columns with le
as comparison operator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With