Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select rows from DataFrame where ID count is greater than X

I have a Dataframe which a column that contains an ID. This ID represents a person and can be present many times:

        col_id    col2     col3     col4....
row1      1
row2      1
row3      2
row4      3
row5      3
row6      3
row7      1
row8      7

I need to return a new dataframe where the value_counts of the ID column is greater than, say 2.

New dataframe:

         col_id    col2     col3     col4....
row1      1
row2      1
row3      3
row4      3
row5      3
row6      1

This new dataframe contains rows where the ID count is greater than 2 only.

Edit

From here I need to separate the data by ID. Ideally I would like a solution where I have a dataframe for each ID:

Dataframe 1

    col_id   col2    col3    col4....
r1     1
r2     1
r3     1

Dataframe 2

    col_id   col2    col3    col4....
r1     2
r2     2
r3     2

Dataframe 3

    col_id   col2    col3    col4....
r1     3
r2     3
r3     3

Is it possible to join these into one large dataframe? So I can have a new column, called 'index' that holds the rows for ID==1, ID==2, etc:

index
  1        col_id   col2    col3    col4....
       r1     1
       r2     1
       r3     1



index
  2       col_id   col2    col3    col4....
      r1     2
      r2     2
      r3     2


    index
       3      col_id   col2    col3    col4....
         r1     3
         r2     3
         r3     3
like image 497
Mazz Avatar asked Mar 11 '19 11:03

Mazz


People also ask

How do I count the number of rows with a specific value in Pandas?

We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.

How do you count how many unique rows a DataFrame has?

You can use the nunique() function to count the number of unique values in a pandas DataFrame.

How do you check for specific values in a DataFrame column a series?

You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series. isin() , str.


1 Answers

Use GroupBy.transform with GroupBy.size for Series with same size like original DataFrame, so possible filter by boolean indexing:

df = df[df.groupby('col_id')['col_id'].transform('size') > 2]
print (df)
      col_id
row1       1
row2       1
row4       3
row5       3
row6       3
row7       1

If performance is not important or small DataFrame is possible use DataFrameGroupBy.filter:

df = df.groupby('col_id').filter(lambda x: len(x) > 2)

EDIT: For separate DataFrames by col_id is possible create dictionary of DataFrames:

dfs = dict(tuple(df.groupby('col_id')))
print (dfs[1])
      col_id
row1       1
row2       1
row7       1

print (dfs[2])
      col_id
row3       2

print (dfs[3])
      col_id
row4       3
row5       3
row6       3

Possible, but non pythonic solution with globals, not recommended use (only for fun):

for i, df in df.groupby('col_id'):
    globals()['df{}'.format(i)] =  df

print (df1)
      col_id
row1       1
row2       1
row7       1
like image 188
jezrael Avatar answered Oct 18 '22 15:10

jezrael