I have a dataframe with sales information in a supermarket. Each row in the dataframe represents an item, with several characteristics as columns. The original DataFrame is something like this:
In [1]: import pandas as pd
my_data = [{'ticket_number' : '001', 'ITEM' : 'vegetable', 'ticket_line' : '1'},
{'TICKET_NUMBER' : '001', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
{'TICKET_NUMBER' : '001', 'ITEM' : 'soup', 'TICKET_ROW' : '3'},
{'TICKET_NUMBER' : '002', 'ITEM' : 'soup', 'TICKET_ROW' : '1'},
{'TICKET_NUMBER' : '002', 'ITEM' : 'drink', 'TICKET_ROW' : '2'},
{'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '1'},
{'TICKET_NUMBER' : '003', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
{'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '3'}]
df = pd.DataFrame(my_data)
In [2]: df
Out [2]:
TICKET_NUMBER TICKET_ROW ITEM
0 001 1 vegetable
1 001 2 vegetable
2 001 3 soup
3 002 1 soup
4 002 2 drink
5 003 1 meat
6 003 2 vegetable
7 003 3 meat
I want to filter out duplicated items that belong to the same ticket. For example, in the first ticket (TICKET_NUMBER==001), there are 2 vegetables, so I want to delete 1 of them. The same happens in ticket number 003 with meat.
So, the final dataset would look like this:
TICKET_NUMBER TICKET_ROW ITEM
0 001 1 vegetable
1 001 3 soup
2 002 1 soup
3 002 2 drink
4 003 1 meat
5 003 2 vegetable
My guess was to groupby
TICKET_NUMBER, then filter ITEM by unique()
, (df.groupby(['TICKET_NUMBER','TICKET_ROW'])['ITEM'].unique()
). Once I have the unique values, I would like to reverse those groups (kind of "ungroupby") to a DataFrame. Is that possible?
I'm sure there are other ways of doing what I'm looking for. Please, help!
Thank you!
Method 1: Count unique values using nunique() The Pandas dataframe. nunique() function returns a series with the specified axis's total number of unique observations. The total number of distinct observations over the index axis is discovered if we set the value of the axis to 0.
You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.
The nunique() method returns the number of unique values for each column. By specifying the column axis ( axis='columns' ), the nunique() method searches column-wise and returns the number of unique values for each row.
In order to get the count of unique values on multiple columns use pandas DataFrame. drop_duplicates() which drop duplicate rows from pandas DataFrame. This eliminates duplicates and return DataFrame with unique rows.
I think you're close. It looks like taking the first TICKET_ROW in the case of duplicates would suffice, and we can use as_index=False
to keep things looking like the original dataframe. So we can group by TICKET_NUMBER and ITEM and take the first TICKET_ROW:
df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()
which gives
In [46]: df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()
Out[46]:
TICKET_NUMBER ITEM TICKET_ROW
0 001 vegetable 1
1 001 soup 3
2 002 soup 1
3 002 drink 2
4 003 meat 1
5 003 vegetable 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With