I have a Dataframe which a column that contains an ID. This ID represents a person and can be present many times:
        col_id    col2     col3     col4....
row1      1
row2      1
row3      2
row4      3
row5      3
row6      3
row7      1
row8      7
I need to return a new dataframe where the value_counts of the ID column is greater than, say 2.
New dataframe:
         col_id    col2     col3     col4....
row1      1
row2      1
row3      3
row4      3
row5      3
row6      1
This new dataframe contains rows where the ID count is greater than 2 only.
Edit
From here I need to separate the data by ID. Ideally I would like a solution where I have a dataframe for each ID:
Dataframe 1
    col_id   col2    col3    col4....
r1     1
r2     1
r3     1
Dataframe 2
    col_id   col2    col3    col4....
r1     2
r2     2
r3     2
Dataframe 3
    col_id   col2    col3    col4....
r1     3
r2     3
r3     3
Is it possible to join these into one large dataframe? So I can have a new column, called 'index' that holds the rows for ID==1, ID==2, etc:
index
  1        col_id   col2    col3    col4....
       r1     1
       r2     1
       r3     1
index
  2       col_id   col2    col3    col4....
      r1     2
      r2     2
      r3     2
    index
       3      col_id   col2    col3    col4....
         r1     3
         r2     3
         r3     3
                We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series. isin() , str.
Use GroupBy.transform with GroupBy.size for Series with same size like original DataFrame, so possible filter by boolean indexing:
df = df[df.groupby('col_id')['col_id'].transform('size') > 2]
print (df)
      col_id
row1       1
row2       1
row4       3
row5       3
row6       3
row7       1
If performance is not important or small DataFrame is possible use DataFrameGroupBy.filter:
df = df.groupby('col_id').filter(lambda x: len(x) > 2)
EDIT: For separate DataFrames by col_id is possible create dictionary of DataFrames:
dfs = dict(tuple(df.groupby('col_id')))
print (dfs[1])
      col_id
row1       1
row2       1
row7       1
print (dfs[2])
      col_id
row3       2
print (dfs[3])
      col_id
row4       3
row5       3
row6       3
Possible, but non pythonic solution with globals, not recommended use (only for fun):
for i, df in df.groupby('col_id'):
    globals()['df{}'.format(i)] =  df
print (df1)
      col_id
row1       1
row2       1
row7       1
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With