I have a Dataframe which a column that contains an ID. This ID represents a person and can be present many times:
col_id col2 col3 col4....
row1 1
row2 1
row3 2
row4 3
row5 3
row6 3
row7 1
row8 7
I need to return a new dataframe where the value_counts of the ID column is greater than, say 2.
New dataframe:
col_id col2 col3 col4....
row1 1
row2 1
row3 3
row4 3
row5 3
row6 1
This new dataframe contains rows where the ID count is greater than 2 only.
Edit
From here I need to separate the data by ID. Ideally I would like a solution where I have a dataframe for each ID:
Dataframe 1
col_id col2 col3 col4....
r1 1
r2 1
r3 1
Dataframe 2
col_id col2 col3 col4....
r1 2
r2 2
r3 2
Dataframe 3
col_id col2 col3 col4....
r1 3
r2 3
r3 3
Is it possible to join these into one large dataframe? So I can have a new column, called 'index' that holds the rows for ID==1, ID==2, etc:
index
1 col_id col2 col3 col4....
r1 1
r2 1
r3 1
index
2 col_id col2 col3 col4....
r1 2
r2 2
r3 2
index
3 col_id col2 col3 col4....
r1 3
r2 3
r3 3
We can count by using the value_counts() method. This function is used to count the values present in the entire dataframe and also count values in a particular column.
You can use the nunique() function to count the number of unique values in a pandas DataFrame.
You can check if a column contains/exists a particular value (string/int), list of multiple values in pandas DataFrame by using pd. series() , in operator, pandas. series. isin() , str.
Use GroupBy.transform
with GroupBy.size
for Series
with same size like original DataFrame, so possible filter by boolean indexing
:
df = df[df.groupby('col_id')['col_id'].transform('size') > 2]
print (df)
col_id
row1 1
row2 1
row4 3
row5 3
row6 3
row7 1
If performance is not important or small DataFrame
is possible use DataFrameGroupBy.filter
:
df = df.groupby('col_id').filter(lambda x: len(x) > 2)
EDIT: For separate DataFrames by col_id
is possible create dictionary of DataFrames:
dfs = dict(tuple(df.groupby('col_id')))
print (dfs[1])
col_id
row1 1
row2 1
row7 1
print (dfs[2])
col_id
row3 2
print (dfs[3])
col_id
row4 3
row5 3
row6 3
Possible, but non pythonic solution with globals, not recommended use (only for fun):
for i, df in df.groupby('col_id'):
globals()['df{}'.format(i)] = df
print (df1)
col_id
row1 1
row2 1
row7 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With