Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering DataFrame based on specific conditions in Python

I have a DataFrame with the following columns: INVOICE_DATE, COUNTRY, CUSTOMER_ID, INVOICE_ID, DESCRIPTION, USIM, and DEMANDQTY. I want to filter the DataFrame based on specific conditions.

enter image description here

The condition is that if the DESCRIPTION column contains the words "kids" or "baby", I want to include all the values from that INVOICE_ID in the filtered DataFrame. In other words, at least one item in the transaction should belong to the kids or baby category for the entire transaction to be included.

I tried using the str.contains() method in combination with a regular expression pattern, but I'm having trouble getting the desired results.

Here's my code:

import pandas as pd

# Assuming the DataFrame is named 'df'

# Filter the DataFrame based on the condition
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]

# Print the filtered DataFrame
filtered_df

However, this code does not provide the expected results. It filters the data frame based on individual rows rather than considering the entire transaction.

Please find below the test data: -

import pandas as pd
import random
import string
import numpy as np

random.seed(42)
np.random.seed(42)

num_transactions = 100
max_items_per_transaction = 6

# Generate a list of possible items
possible_items = [
    "Kids T-shirt", "Baby Onesie", "Kids Socks",
    "Men's Shirt", "Women's Dress", "Kids Pants",
    "Baby Hat", "Women's Shoes", "Men's Pants",
    "Kids Jacket", "Baby Bib", "Men's Hat",
    "Women's Skirt", "Kids Shoes", "Baby Romper",
    "Men's Sweater", "Kids Gloves", "Baby Blanket"
]

# Create the DataFrame
rows = []

for i in range(num_transactions):
    num_items = random.randint(1, max_items_per_transaction)
    items = random.sample(possible_items, num_items)
    invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
    countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
    customer_id = i + 1
    invoice_id = 1001 + i

    for j in range(num_items):
        item = items[j]
        usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6))  # Generate a random 6-character USIM value
        demand_qty = random.randint(1, 10)

        row = {
            'INVOICE_DATE': invoice_dates[j],
            'COUNTRY': countries[j],
            'CUSTOMER_ID': customer_id,
            'INVOICE_ID': invoice_id,
            'DESCRIPTION': item,
            'USIM': usim,
            'DEMANDQTY': demand_qty
        }
        rows.append(row)

df = pd.DataFrame(rows)

# Print the DataFrame
df

Can anyone please guide me on how to properly filter the DataFrame based on the described condition? I would greatly appreciate any help or suggestions. Thank you!

like image 613
sdave Avatar asked Jun 14 '26 04:06

sdave


1 Answers

Suppose the following dataframe:

>>> df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123
3     another         456
4         one         456

You can want to keep INVOICE_ID=123 because 'kids' is in the description of row 0:

m = df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)
filtered_df = df[m.groupby(df['INVOICE_ID']).transform('max')]

Output:

>>> filtered_df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123
like image 109
Corralien Avatar answered Jun 16 '26 20:06

Corralien



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!