Filtering each X in DataFrame with values from other Series/DataFrame (area under curve)

Question

I'm filtering over a DataFrame to get the area under a curve. I've managed to get the border of the curve, such that we only want rows under that curve.

The way I've gone about this is by getting the data_y_border (red curve in diagram) with (1) in the code below (this works fine). This will contain the topmost Y for each X where the values of another column are >= 0.7, such that I can query data_y_border[x_value] and get the corresponding topmost Y.

Note: data_y_border is not the lowest values of Y in the entire dataset. data (blue rectangle in diagram) is our dataset, and data_y_border is the lower boundary of the red area defined by the Density column where its values are above 0.7:

    density_zone = data[
        (full_dataset["X" < x_right_boundary)
        & (full_dataset['Density'] >= 0.7)
        & (full_dataset['Y'] > y_lower_boundary)
    ]

data_y_border is the bottom of the red region. Anything under it doesn't have a Density > 0.7.

Area under a curve

I now want to use the Y value for each X position to keep all rows where that X value corresponds to a Y <= its topmost Y (indata_y_border).

I'm trying a combination of loc and lambda in [2] below to compare the row value to the topmost Y per row but I'm getting errors saying:

ValueError: Can only compare identically-labeled Series objects

Code:

[1] data_y_border = density_zone.groupby("X")["Y"].min() #returns Series

                          or

    data_y_border = density_zone.loc[density_zone.groupby("X")["Y"].idxmin() # returns DataFrame
    # as per @enke's suggestion

[2] data.loc[lambda row: row['Y'] <= data_y_border.get(row['X'])]

    # get the X value for `row`, 
    # use it as the index in `data_y_border` to get the corresponding Y // value, 
    # compare that row's Y value to see if it's less than or equal to the topmost Y. 
    # If it is, keep it

The DataFrame has about 23 columns in it, but as an example, given the following data DataFrame and data_y_border, I would expect to keep the expected out below:

data = 
X    Y        OtherDataIWantToKeep
2.0  307.0    ...
2.0  155.3    ...     
2.0  120.0    ...     
2.0  80.2     ...        
4.0  500.3    ...
4.0  270.8    ...
4.0  111.2    ...
4.0  78.23    ...
4.0  6.3      ...

data_y_border=
2.0, 155.3
4.0, 111.2

expected output rows (including all data from other columns):

X    Y        OtherDataIWantToKeep
2.0  155.3    ...     
2.0  120.0    ...     
2.0  80.2     ...        
4.0  111.2    ...
4.0  78.23    ...
4.0  6.3      ...

I've tried combinations involving .apply instead, but I get key errors with that approach. I get the feeling the issue is with the data_y_border.get(row['X']) part of the code above, where Pandas doesn't like running a query on a separate filter in order to use that value to filter the current DataFrame.

Is using loc and lambda not the right way to filter over every row in a DataFrame to compare each row's value to a mapped out in another DataFrame/Series?

I've considered iterrows for this (if it were Arrays/Lists in Python/JS I would have mapped over them) but that feels too expensive for a pretty sizeable DataFrame

Admin · Accepted Answer

From your comment:

The curve is based on values from another column. It's basically rows where values for another column are greater than a certain value, find the lowest Y for each X. That becomes our curve boundary. Using that curve we want to find the rows in the area beneath the curve.

it seems data_y_border is calculated independently from data. So let's take it as given (as given in the question). Then we could map it to data['X'] and compare it with data['Y']; then filter:

out = data[data['Y'] <= data['X'].map(data_y_border.set_index('X')['Y'])]

Output:

     X       Y OtherDataIWantToKeep
1  2.0  155.30                  ...
2  2.0  120.00                  ...
3  2.0   80.20                  ...
6  4.0  111.20                  ...
7  4.0   78.23                  ...
8  4.0    6.30                  ...

Filtering each X in DataFrame with values from other Series/DataFrame (area under curve)

Tags:

python

pandas

dataframe

lambda

pandas-groupby

GroomedGorilla

1 Answers

Recent Activity

Donate For Us

Filtering each X in DataFrame with values from other Series/DataFrame (area under curve)

Tags:

python

pandas

dataframe

lambda

pandas-groupby

GroomedGorilla

1 Answers

Related questions

Recent Activity

Donate For Us