Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

I need to create a python list object, or any object, out of a pandas DataFrame object grouping pieces of values from different rows

My DataFrame has a string in the first column, and a number in the second one:

            GEOSTRING  IDactivity
9     wydm2p01uk0fd2z           2
10    wydm86pg6r3jyrg           2
11    wydm2p01uk0fd2z           2
12    wydm80xfxm9j22v           2
39    wydm9w92j538xze           4
40    wydm8km72gbyuvf           4
41    wydm86pg6r3jyrg           4
42    wydm8mzt874p1v5           4
43    wydm8mzmpz5gkt8           5
44    wydm86pg6r3jyrg           5
45    wydm8w1q8bjfpcj           5
46    wydm8w1q8bjfpcj           5

What I want to do is to manipulate this DataFrame in order to have a list object that contains a string, made out of the 5th character for each "GEOSTRING" value, for each different "IDactivity" value. So in this case, I have 3 different "IDactivity" values, and I will have in my list object 3 strings that look like this:

['2828', '9888','8888']

where again, the symbols you see in each string, are the 5th value of each "GEOSTRING" value.

What I'm asking is a solution, or an approach, that doesn't involve a too complicated for loop and have it as efficient as possible since I have to manipulate lots of data. I'd like it to be clean and fast.

I hope it's clear enough.

like image 526
zampero Avatar asked Jul 08 '17 21:07

zampero


2 Answers

this can be done easily as follows as a one liner: (considered to be pretty fast too)

result = df.groupby('IDactivity')['GEOSTRING'].apply(lambda x:''.join(x.str[4])).tolist()

this groups the dataframe by values of IDactivity then select from each corresponding string of GEOSTRING column the 5th element (index 4) and joins it with the other corresponding strings. Finally we add tolist() method to get the output as list not pandas Series.

output:

['2828', '9888', '8888']

Documentation:

pandas.groupby
pandas.apply

like image 166
Rayhane Mama Avatar answered Oct 06 '22 00:10

Rayhane Mama


Here's a solution involving a temp column, and taking inspiration for the key operation from this answer:

# create a temp column with the character we want from each string
dframe['Temp'] = dframe['GEOSTRING'].apply(lambda x: x[4])

# groupby ID and then concatenate using a sneaky call to .sum()
dframe.groupby('IDactivity')['Temp'].sum().tolist()

Result:

['2828', '9888', '8888']
like image 26
cmaher Avatar answered Oct 05 '22 23:10

cmaher