Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: filter unique values in groups

Tags:

python

pandas

I have a dataframe with sales information in a supermarket. Each row in the dataframe represents an item, with several characteristics as columns. The original DataFrame is something like this:

In [1]: import pandas as pd
        my_data = [{'ticket_number' : '001', 'ITEM' : 'vegetable', 'ticket_line' : '1'},
               {'TICKET_NUMBER' : '001', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '001', 'ITEM' : 'soup', 'TICKET_ROW' : '3'},
               {'TICKET_NUMBER' : '002', 'ITEM' : 'soup', 'TICKET_ROW' : '1'},
               {'TICKET_NUMBER' : '002', 'ITEM' : 'drink', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '1'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '3'}]
        df = pd.DataFrame(my_data)

In [2]: df
Out [2]:    
            TICKET_NUMBER   TICKET_ROW        ITEM
         0        001            1           vegetable
         1        001            2           vegetable
         2        001            3           soup
         3        002            1           soup
         4        002            2           drink
         5        003            1           meat
         6        003            2           vegetable
         7        003            3           meat

I want to filter out duplicated items that belong to the same ticket. For example, in the first ticket (TICKET_NUMBER==001), there are 2 vegetables, so I want to delete 1 of them. The same happens in ticket number 003 with meat.

So, the final dataset would look like this:

        TICKET_NUMBER   TICKET_ROW        ITEM
     0        001            1           vegetable
     1        001            3           soup
     2        002            1           soup
     3        002            2           drink
     4        003            1           meat
     5        003            2           vegetable

My guess was to groupby TICKET_NUMBER, then filter ITEM by unique(), (df.groupby(['TICKET_NUMBER','TICKET_ROW'])['ITEM'].unique()). Once I have the unique values, I would like to reverse those groups (kind of "ungroupby") to a DataFrame. Is that possible?

I'm sure there are other ways of doing what I'm looking for. Please, help!

Thank you!

like image 871
Andres Avatar asked Oct 08 '15 15:10

Andres


People also ask

How do you get unique values on Groupby pandas?

Method 1: Count unique values using nunique() The Pandas dataframe. nunique() function returns a series with the specified axis's total number of unique observations. The total number of distinct observations over the index axis is discovered if we set the value of the axis to 0.

How do I filter unique values in pandas?

You can get unique values in column (multiple columns) from pandas DataFrame using unique() or Series. unique() functions. unique() from Series is used to get unique values from a single column and the other one is used to get from multiple columns.

What is Nunique () in pandas?

The nunique() method returns the number of unique values for each column. By specifying the column axis ( axis='columns' ), the nunique() method searches column-wise and returns the number of unique values for each row.

How do I get the number of unique values in a column in pandas?

In order to get the count of unique values on multiple columns use pandas DataFrame. drop_duplicates() which drop duplicate rows from pandas DataFrame. This eliminates duplicates and return DataFrame with unique rows.


1 Answers

I think you're close. It looks like taking the first TICKET_ROW in the case of duplicates would suffice, and we can use as_index=False to keep things looking like the original dataframe. So we can group by TICKET_NUMBER and ITEM and take the first TICKET_ROW:

df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()

which gives

In [46]: df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()
Out[46]: 
  TICKET_NUMBER       ITEM TICKET_ROW
0           001  vegetable          1
1           001       soup          3
2           002       soup          1
3           002      drink          2
4           003       meat          1
5           003  vegetable          2
like image 77
DSM Avatar answered Oct 23 '22 01:10

DSM