Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keeping rows in Pandas where the same ID appears more than n times and convert to list per ID

Tags:

python

pandas

I have a DataFrame like below:

     VID  value
1     1    xx
2     2    xx1
3     2    xx2
4     2    xx3
5     2    xx4
6     3    xx
7     3    xx
8     3    xx
9     4    zz1
10    4    zz2
11    4    zz3
12    4    zz4
13    4    zz5

I want to keep only the rows whose VID exists in more than n=3 cases. In the above example, I'd keep only rows 2-5 and 9-13 (since only the VIDs 2 and 4 appear more than 3 times). The 'value' field is irrelevant.

After keeping the needed rows, I'd like to convert my data as a list of lists while appending an 'end' value at the end of each one:

[[xx1, xx2, xx3, xx4, 'end'], [zz1, zz2, zz3, zz4, zz5, 'end']]

Is it possible to get the above without a for loop?

like image 667
Stergios Avatar asked Oct 20 '15 15:10

Stergios


People also ask

How do you group similar rows in pandas?

You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.

What is the difference between ILOC () and LOC ()?

When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.


1 Answers

You can group by VID column and then take count of each group. Then after that use it for indexing your original df, so as to only take rows that have greater than 3 count. Example -

countdf = df.groupby('VID').count()
result = df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)]

Demo -

In [49]: df
Out[49]:
    VID value
1     1    xx
2     2   xx1
3     2   xx2
4     2   xx3
5     2   xx4
6     3    xx
7     3    xx
8     3    xx
9     4   zz1
10    4   zz2
11    4   zz3
12    4   zz4
13    4   zz5

In [51]: df.groupby('VID').count()
Out[51]:
     value
VID
1        1
2        4
3        3
4        5

In [52]: countdf = df.groupby('VID').count()

In [53]: df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)]
Out[53]:
    VID value
2     2   xx1
3     2   xx2
4     2   xx3
5     2   xx4
9     4   zz1
10    4   zz2
11    4   zz3
12    4   zz4
13    4   zz5

Then after this, you can again group based on VID and then convert the groups into list and then again to list , Example -

resultlist = result.groupby('VID')['value'].apply(list).tolist()

Demo -

In [54]: result = df.loc[df['VID'].isin(countdf[countdf['value'] > 3].index)]

In [55]: result.groupby('VID')['value'].apply(list).tolist()
Out[55]: [['xx1', 'xx2', 'xx3', 'xx4'], ['zz1', 'zz2', 'zz3', 'zz4', 'zz5']]

Please note, above would not get you the 'end' value in the list, I am guessing that is not necessary, but if you really want that, you can manually add it after getting the list. Example -

resultlist = [elem + ['end'] for elem in resultlist]
like image 174
Anand S Kumar Avatar answered Oct 10 '22 21:10

Anand S Kumar