I am stuck on how to filter out a set of rows based on the indexed position. For more clarity, lets have a dummy problem, lets say I have a datframe of user having multiple profile, like in example df1 i have three user John, Johnny and Ben having their multiple profile..
df1 = pd.DataFrame({"user": ["Peter (1)", "Peter (2)", "Peter (3)","John (1)","John (2)","John (3)","Johnny (1)","Johnny (2)"], "value": [1, 3, 3, 1, 6, 3, 4, 1]}, )
I sort the df1 based on value and reindex it
df1=df1.sort_values(by='value', ascending=False)
df1.index=[0, 1, 2, 3, 4, 5, 6, 7]
df1 looks like this
Now i am stuck as how to filter out rows for user having first indexed value (in this case John), lets say to new dataframe df2, and also to filter out rows for second indexed user (in this case Johnny) to new dataframe df3 expected df2 should look like this
df3 should look like below
After sorting the dataframe you can use str.split
to split the strings in the user
column to create a grouping key
, then group
the dataframe on this grouping key and for each subgroup per user
create a mapping of user
-> dataframe
inside a dict
comprehension:
key = df1['user'].str.split().str[0]
dct = {user:grp.reset_index(drop=True) for user, grp in df1.groupby(key)}
Now to access the dataframe corresponding to the user
we can simply lookup inside the dictionary:
>>> dct['John']
user value
0 John (2) 6
1 John (3) 3
2 John (1) 1
>>> dct['Peter']
user value
0 Peter (2) 3
1 Peter (3) 3
2 Peter (1) 1
>>> dct['Johnny']
user value
0 Johnny (1) 4
1 Johnny (2) 1
You can get the first index value and split it and exclude last item(assuming that user name may have parenthesis), and then search for the value in the entire dataframe for that particular column. For example:
firstIndexUser = df1['user'].str.split('(').str[:-1].str.join('(').iloc[0]
This firstIndexUser will have value as 'John' Now you can compare with against the entire dataframe to get your df2
df2 = df1[df1['user'].str.split('(').str[:-1].str.join('(')==firstIndexUser]
The output looks like this:
>>df2
user value
0 John (2) 6
4 John (3) 3
6 John (1) 1
If you want, you can reset the index for df2
>>df2.reset_index(drop=True, inplace=True)
>>df2
user value
0 John (2) 6
1 John (3) 3
2 John (1) 1
You can follow the similar approach for your df3
df1 = pd.DataFrame({"user": ["Peter (1)", "Peter (2)", "Peter (3)","John (1)","John (2)","John (3)","Johnny (1)","Johnny (2)"], "value": [1, 3, 3, 1, 6, 3, 4, 1]}, )
df1=df1.sort_values(by='value', ascending=False)
cols = df1.columns.tolist()
df1['name'] = df1['user'].replace(r'\s\(\d\)','',regex=True)
grp = df1.groupby(by=['name'])
dataframes = [grp.get_group(x)[cols] for x in grp.groups]
df2, df3 = dataframes[:2] # as mentioned, we are interested just in first two users
df2
:
user value
3 John (1) 1
4 John (2) 6
5 John (3) 3
df3
:
user value
6 Johnny (1) 4
7 Johnny (2) 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With