Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas subset column x values based on unique values in column y

I have a dataframe ( "df") equivalent to:

   Cat   Data
    x    0.112
    x    0.112
    y    0.223
    y    0.223
    z    0.112
    z    0.112

In other words I have a category column and a data column, and the data values do not vary within values of the category column, but they may repeat themselves between different categories (i.e. the values in categories 'x' and 'z' are the same -- 0.112). This means that I need to select one data point from each category, rather than just subsetting on unique values of "Data".

The way I've done it is like this:

    aLst = []
    bLst = []
    for i in df.index:
        if df.loc[i,'Cat'] not in aLst:
            aLst += [df.loc[i,'Cat']]
            bLst += [i]

    new_series = pd.Series(df.loc[bLst,'Data'])

Then I can do whatever I want with it. But the problem is this just seems like a clunky, un-pythonic way of doing things. Any suggestions?

like image 347
Cole Robertson Avatar asked Nov 18 '16 15:11

Cole Robertson


People also ask

How do you select rows based on distinct values of a column only pandas?

And you can use the following syntax to select unique rows across specific columns in a pandas DataFrame: df = df. drop_duplicates(subset=['col1', 'col2', ...])

How do I get a column value of a pandas DataFrame based on another column in Python?

You can extract a column of pandas DataFrame based on another value by using the DataFrame. query() method. The query() is used to query the columns of a DataFrame with a boolean expression. The blow example returns a Courses column where the Fee column value matches with 25000.

How do I get a list of unique values in a column in Python?

DataFrame(). unique() method is used when we deal with a single column of a DataFrame and returns all unique elements of a column. The method returns a DataFrame containing the unique elements of a column, along with their corresponding index labels.


1 Answers

I think you need drop_duplicates:

#by column Cat
print (df.drop_duplicates(['Cat']))
  Cat   Data
0   x  0.112
2   y  0.223
4   z  0.112

Or:

#by columns Cat and Value
print (df.drop_duplicates(['Cat','Data']))
  Cat   Data
0   x  0.112
2   y  0.223
4   z  0.112
like image 189
jezrael Avatar answered Sep 22 '22 00:09

jezrael