I'm filtering over a DataFrame to get the area under a curve. I've managed to get the border of the curve, such that we only want rows under that curve.
The way I've gone about this is by getting the data_y_border (red curve in diagram) with (1) in the code below (this works fine). This will contain the topmost Y for each X where the values of another column are >= 0.7, such that I can query data_y_border[x_value] and get the corresponding topmost Y.
Note: data_y_border is not the lowest values of Y in the entire dataset. data (blue rectangle in diagram) is our dataset, and data_y_border is the lower boundary of the red area defined by the Density column where its values are above 0.7:
density_zone = data[
(full_dataset["X" < x_right_boundary)
& (full_dataset['Density'] >= 0.7)
& (full_dataset['Y'] > y_lower_boundary)
]
data_y_border is the bottom of the red region. Anything under it doesn't have a Density > 0.7.

I now want to use the Y value for each X position to keep all rows where that X value corresponds to a Y <= its topmost Y (indata_y_border).
I'm trying a combination of loc and lambda in [2] below to compare the row value to the topmost Y per row but I'm getting errors saying:
ValueError: Can only compare identically-labeled Series objects
Code:
[1] data_y_border = density_zone.groupby("X")["Y"].min() #returns Series
or
data_y_border = density_zone.loc[density_zone.groupby("X")["Y"].idxmin() # returns DataFrame
# as per @enke's suggestion
[2] data.loc[lambda row: row['Y'] <= data_y_border.get(row['X'])]
# get the X value for `row`,
# use it as the index in `data_y_border` to get the corresponding Y // value,
# compare that row's Y value to see if it's less than or equal to the topmost Y.
# If it is, keep it
The DataFrame has about 23 columns in it, but as an example, given the following data DataFrame and data_y_border, I would expect to keep the expected out below:
data =
X Y OtherDataIWantToKeep
2.0 307.0 ...
2.0 155.3 ...
2.0 120.0 ...
2.0 80.2 ...
4.0 500.3 ...
4.0 270.8 ...
4.0 111.2 ...
4.0 78.23 ...
4.0 6.3 ...
data_y_border=
2.0, 155.3
4.0, 111.2
expected output rows (including all data from other columns):
X Y OtherDataIWantToKeep
2.0 155.3 ...
2.0 120.0 ...
2.0 80.2 ...
4.0 111.2 ...
4.0 78.23 ...
4.0 6.3 ...
I've tried combinations involving .apply instead, but I get key errors with that approach. I get the feeling the issue is with the data_y_border.get(row['X']) part of the code above, where Pandas doesn't like running a query on a separate filter in order to use that value to filter the current DataFrame.
Is using loc and lambda not the right way to filter over every row in a DataFrame to compare each row's value to a mapped out in another DataFrame/Series?
I've considered iterrows for this (if it were Arrays/Lists in Python/JS I would have mapped over them) but that feels too expensive for a pretty sizeable DataFrame
From your comment:
The curve is based on values from another column. It's basically rows where values for another column are greater than a certain value, find the lowest Y for each X. That becomes our curve boundary. Using that curve we want to find the rows in the area beneath the curve.
it seems data_y_border is calculated independently from data. So let's take it as given (as given in the question). Then we could map it to data['X'] and compare it with data['Y']; then filter:
out = data[data['Y'] <= data['X'].map(data_y_border.set_index('X')['Y'])]
Output:
X Y OtherDataIWantToKeep
1 2.0 155.30 ...
2 2.0 120.00 ...
3 2.0 80.20 ...
6 4.0 111.20 ...
7 4.0 78.23 ...
8 4.0 6.30 ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With