I have a DataFrame with the following columns: INVOICE_DATE, COUNTRY, CUSTOMER_ID, INVOICE_ID, DESCRIPTION, USIM, and DEMANDQTY. I want to filter the DataFrame based on specific conditions.

The condition is that if the DESCRIPTION column contains the words "kids" or "baby", I want to include all the values from that INVOICE_ID in the filtered DataFrame. In other words, at least one item in the transaction should belong to the kids or baby category for the entire transaction to be included.
I tried using the str.contains() method in combination with a regular expression pattern, but I'm having trouble getting the desired results.
Here's my code:
import pandas as pd
# Assuming the DataFrame is named 'df'
# Filter the DataFrame based on the condition
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]
# Print the filtered DataFrame
filtered_df
However, this code does not provide the expected results. It filters the data frame based on individual rows rather than considering the entire transaction.
Please find below the test data: -
import pandas as pd
import random
import string
import numpy as np
random.seed(42)
np.random.seed(42)
num_transactions = 100
max_items_per_transaction = 6
# Generate a list of possible items
possible_items = [
"Kids T-shirt", "Baby Onesie", "Kids Socks",
"Men's Shirt", "Women's Dress", "Kids Pants",
"Baby Hat", "Women's Shoes", "Men's Pants",
"Kids Jacket", "Baby Bib", "Men's Hat",
"Women's Skirt", "Kids Shoes", "Baby Romper",
"Men's Sweater", "Kids Gloves", "Baby Blanket"
]
# Create the DataFrame
rows = []
for i in range(num_transactions):
num_items = random.randint(1, max_items_per_transaction)
items = random.sample(possible_items, num_items)
invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
customer_id = i + 1
invoice_id = 1001 + i
for j in range(num_items):
item = items[j]
usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6)) # Generate a random 6-character USIM value
demand_qty = random.randint(1, 10)
row = {
'INVOICE_DATE': invoice_dates[j],
'COUNTRY': countries[j],
'CUSTOMER_ID': customer_id,
'INVOICE_ID': invoice_id,
'DESCRIPTION': item,
'USIM': usim,
'DEMANDQTY': demand_qty
}
rows.append(row)
df = pd.DataFrame(rows)
# Print the DataFrame
df
Can anyone please guide me on how to properly filter the DataFrame based on the described condition? I would greatly appreciate any help or suggestions. Thank you!
Suppose the following dataframe:
>>> df
DESCRIPTION INVOICE_ID
0 kids 123
1 hello 123
2 world 123
3 another 456
4 one 456
You can want to keep INVOICE_ID=123 because 'kids' is in the description of row 0:
m = df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)
filtered_df = df[m.groupby(df['INVOICE_ID']).transform('max')]
Output:
>>> filtered_df
DESCRIPTION INVOICE_ID
0 kids 123
1 hello 123
2 world 123
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With