Filtering DataFrame based on specific conditions in Python

Question

I have a DataFrame with the following columns: INVOICE_DATE, COUNTRY, CUSTOMER_ID, INVOICE_ID, DESCRIPTION, USIM, and DEMANDQTY. I want to filter the DataFrame based on specific conditions.

enter image description here

The condition is that if the DESCRIPTION column contains the words "kids" or "baby", I want to include all the values from that INVOICE_ID in the filtered DataFrame. In other words, at least one item in the transaction should belong to the kids or baby category for the entire transaction to be included.

I tried using the str.contains() method in combination with a regular expression pattern, but I'm having trouble getting the desired results.

Here's my code:

import pandas as pd

# Assuming the DataFrame is named 'df'

# Filter the DataFrame based on the condition
filtered_df = df[df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)]

# Print the filtered DataFrame
filtered_df

However, this code does not provide the expected results. It filters the data frame based on individual rows rather than considering the entire transaction.

Please find below the test data: -

import pandas as pd
import random
import string
import numpy as np

random.seed(42)
np.random.seed(42)

num_transactions = 100
max_items_per_transaction = 6

# Generate a list of possible items
possible_items = [
    "Kids T-shirt", "Baby Onesie", "Kids Socks",
    "Men's Shirt", "Women's Dress", "Kids Pants",
    "Baby Hat", "Women's Shoes", "Men's Pants",
    "Kids Jacket", "Baby Bib", "Men's Hat",
    "Women's Skirt", "Kids Shoes", "Baby Romper",
    "Men's Sweater", "Kids Gloves", "Baby Blanket"
]

# Create the DataFrame
rows = []

for i in range(num_transactions):
    num_items = random.randint(1, max_items_per_transaction)
    items = random.sample(possible_items, num_items)
    invoice_dates = pd.date_range(start='2022-01-01', periods=num_items, freq='D')
    countries = random.choices(['USA', 'Canada', 'UK'], k=num_items)
    customer_id = i + 1
    invoice_id = 1001 + i

    for j in range(num_items):
        item = items[j]
        usim = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6))  # Generate a random 6-character USIM value
        demand_qty = random.randint(1, 10)

        row = {
            'INVOICE_DATE': invoice_dates[j],
            'COUNTRY': countries[j],
            'CUSTOMER_ID': customer_id,
            'INVOICE_ID': invoice_id,
            'DESCRIPTION': item,
            'USIM': usim,
            'DEMANDQTY': demand_qty
        }
        rows.append(row)

df = pd.DataFrame(rows)

# Print the DataFrame
df

Can anyone please guide me on how to properly filter the DataFrame based on the described condition? I would greatly appreciate any help or suggestions. Thank you!

Corralien · Accepted Answer

Suppose the following dataframe:

>>> df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123
3     another         456
4         one         456

You can want to keep INVOICE_ID=123 because 'kids' is in the description of row 0:

m = df['DESCRIPTION'].str.contains('kids|baby', case=False, regex=True)
filtered_df = df[m.groupby(df['INVOICE_ID']).transform('max')]

Output:

>>> filtered_df
  DESCRIPTION  INVOICE_ID
0        kids         123
1       hello         123
2       world         123

Filtering DataFrame based on specific conditions in Python

Tags:

python

python-3.x

pandas

dataframe

sdave

1 Answers

Corralien

Recent Activity

Donate For Us

Filtering DataFrame based on specific conditions in Python

Tags:

python

python-3.x

pandas

dataframe

sdave

1 Answers

Corralien

Related questions

Recent Activity

Donate For Us