I have a dataframe with sales information in a supermarket. Each row in the dataframe represents an item, with several characteristics as columns. The original DataFrame is something like this: <pre class="prettyprint"><code>In [1]: import pandas as pd my_data = [{'ticket_number' : '001', 'ITEM' : 'vegetable', 'ticket_line' : '1'}, {'TICKET_NUMBER' : '001', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'}, {'TICKET_NUMBER' : '001', 'ITEM' : 'soup', 'TICKET_ROW' : '3'}, {'TICKET_NUMBER' : '002', 'ITEM' : 'soup', 'TICKET_ROW' : '1'}, {'TICKET_NUMBER' : '002', 'ITEM' : 'drink', 'TICKET_ROW' : '2'}, {'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '1'}, {'TICKET_NUMBER' : '003', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'}, {'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '3'}] df = pd.DataFrame(my_data) In [2]: df Out [2]: TICKET_NUMBER TICKET_ROW ITEM 0 001 1 vegetable 1 001 2 vegetable 2 001 3 soup 3 002 1 soup 4 002 2 drink 5 003 1 meat 6 003 2 vegetable 7 003 3 meat </code></pre> I want to filter out duplicated items that belong to the same ticket. For example, in the first ticket (TICKET_NUMBER==001), there are 2 vegetables, so I want to delete 1 of them. The same happens in ticket number 003 with meat. So, the final dataset would look like this: <pre class="prettyprint"><code> TICKET_NUMBER TICKET_ROW ITEM 0 001 1 vegetable 1 001 3 soup 2 002 1 soup 3 002 2 drink 4 003 1 meat 5 003 2 vegetable </code></pre> My guess was to <code>groupby</code> TICKET_NUMBER, then filter ITEM by <code>unique()</code>, (<code>df.groupby(['TICKET_NUMBER','TICKET_ROW'])['ITEM'].unique()</code>). Once I have the unique values, I would like to reverse those groups (kind of "ungroupby") to a DataFrame. Is that possible? I'm sure there are other ways of doing what I'm looking for. Please, help! Thank you!

I think you're close. It looks like taking the first TICKET_ROW in the case of duplicates would suffice, and we can use <code>as_index=False</code> to keep things looking like the original dataframe. So we can group by TICKET_NUMBER and ITEM and take the first TICKET_ROW: <pre class="prettyprint"><code>df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first() </code></pre> which gives <pre class="prettyprint"><code>In [46]: df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first() Out[46]: TICKET_NUMBER ITEM TICKET_ROW 0 001 vegetable 1 1 001 soup 3 2 002 soup 1 3 002 drink 2 4 003 meat 1 5 003 vegetable 2 </code></pre>

Pandas: filter unique values in groups

Tags:

python

pandas

I have a dataframe with sales information in a supermarket. Each row in the dataframe represents an item, with several characteristics as columns. The original DataFrame is something like this:

In [1]: import pandas as pd
        my_data = [{'ticket_number' : '001', 'ITEM' : 'vegetable', 'ticket_line' : '1'},
               {'TICKET_NUMBER' : '001', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '001', 'ITEM' : 'soup', 'TICKET_ROW' : '3'},
               {'TICKET_NUMBER' : '002', 'ITEM' : 'soup', 'TICKET_ROW' : '1'},
               {'TICKET_NUMBER' : '002', 'ITEM' : 'drink', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '1'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'vegetable', 'TICKET_ROW' : '2'},
               {'TICKET_NUMBER' : '003', 'ITEM' : 'meat', 'TICKET_ROW' : '3'}]
        df = pd.DataFrame(my_data)

In [2]: df
Out [2]:    
            TICKET_NUMBER   TICKET_ROW        ITEM
         0        001            1           vegetable
         1        001            2           vegetable
         2        001            3           soup
         3        002            1           soup
         4        002            2           drink
         5        003            1           meat
         6        003            2           vegetable
         7        003            3           meat

I want to filter out duplicated items that belong to the same ticket. For example, in the first ticket (TICKET_NUMBER==001), there are 2 vegetables, so I want to delete 1 of them. The same happens in ticket number 003 with meat.

So, the final dataset would look like this:

        TICKET_NUMBER   TICKET_ROW        ITEM
     0        001            1           vegetable
     1        001            3           soup
     2        002            1           soup
     3        002            2           drink
     4        003            1           meat
     5        003            2           vegetable

My guess was to groupby TICKET_NUMBER, then filter ITEM by unique(), (df.groupby(['TICKET_NUMBER','TICKET_ROW'])['ITEM'].unique()). Once I have the unique values, I would like to reverse those groups (kind of "ungroupby") to a DataFrame. Is that possible?

I'm sure there are other ways of doing what I'm looking for. Please, help!

Thank you!

871

asked Oct 08 '15 15:10

Andres

1 Answers

I think you're close. It looks like taking the first TICKET_ROW in the case of duplicates would suffice, and we can use as_index=False to keep things looking like the original dataframe. So we can group by TICKET_NUMBER and ITEM and take the first TICKET_ROW:

df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()

which gives

In [46]: df.groupby(["TICKET_NUMBER", "ITEM"], sort=False, as_index=False)["TICKET_ROW"].first()
Out[46]: 
  TICKET_NUMBER       ITEM TICKET_ROW
0           001  vegetable          1
1           001       soup          3
2           002       soup          1
3           002      drink          2
4           003       meat          1
5           003  vegetable          2

answered Oct 23 '22 01:10

DSM

Related questions
                            
                                Export pip packages
                            
                                sklearn - model keeps overfitting
                            
                                How to tell the HTTP server to not send chunked encoding
                            
                                How to run WordCountTopology from storm-starter in Intellij
                            
                                Is there an alternative to Ansible on Python3
                            
                                How to prevent key creation through d[key] = val
                            
                                How to raise an error / return a {"foo":["This field is required."]} response in Django REST
                            
                                Multiple results for each individual row (one-to-many) with Pandas
                            
                                How to create .exe using py2exe(or pyinstaller) on Ubuntu
                            
                                Detecting unicode private use area characters with python
                            
                                Is it possible to make a module iterable in Python?
                            
                                How to get sql query from peewee?
                            
                                How to convert Pandas Index to month name
                            
                                Missing conf file for sphinx command
                            
                                How to get the number of <p> tags inside div in scrapy?
                            
                                python xgboost on mac install
                            
                                Better way to check a list for specific elements - python
                            
                                Matplotlib Navigation Toolbar: remove "Edit curves lines and axes parameters"
                            
                                Set level values in MultiIndex
                            
                                Problems with isin pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With