Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting a random value in a pandas data frame by column

Suppose I have the following pandas data frame:

userID dayID feature0 feature1 feature2 feature3
xy1      0      24       15.3      41     43
xy1      1      5        24        34     40
xy1      2      30       7          8     10
gh3      0      50       4          11    12
gh3      1      49       3          59    11
gh3      2      4        9          12    15
...

There are many userID's and each one has 3 days and 4 features per day. What I want to do is for each feature, select 1 of the days at random and then cut down the matrix. So for instance, if feature 0 was day 1, feature 1 used day 0, feature 2 used day 0 and feature 3 used day 2:

  userID feature0 feature1 feature2 feature3
    xy1      5       15.3      41     10
    gh3      49       4        11      15
    ... 

And so forth.

I've come up with:

I thought the this code works but it does not.

reduced_features = features.reset_index().groupby('userID').agg(lambda x: np.random.choice(x,1))

But this seems slow. Is there a faster way to do it?

like image 831
user1357015 Avatar asked May 04 '18 00:05

user1357015


Video Answer


1 Answers

Since you don't get more suggestions, I will give it a try:

check the following code sample (explanation in the code comments):

import pandas as pd
import numpy as np
from io import StringIO

str = """userID  dayID  feature0  feature1  feature2  feature3
xy1      0        24      15.3        41        43
xy1      1         5      24.0        34        40
xy1      2        30       7.0         8        10
gh3      0        50       4.0        11        12
gh3      1        49       3.0        59        11
gh3      2         4       9.0        12        15
"""

df = pd.read_table(StringIO(str), sep='\s+')

def randx(dfg):
    # create a list of row-indices and make sure 0,1,2 are all in so that  
    # all dayIDs are covered and the last one is randomly selected from [0,1,2]
    x = [ 0, 1, 2, np.random.randint(3) ]

    # shuffle the list of row-indices
    np.random.shuffle(x)

    # enumerate list-x, with the row-index and the counter aligned with the column-index,
    # to retrieve the actual element in the dataframe. the 2 in enumerate 
    # is to skip the first two columns which are 'userID' and 'dayID'
    return pd.Series([ dfg.iat[j,i] for i,j in enumerate(x,2) ])

    ## you can also return the list of result into one column
#    return [ dfg.iat[j,i] for i,j in enumerate(x,2) ]

def feature_name(x):
    return 'feature{}'.format(x)

# if you have many irrelevant columns, then
# retrieve only columns required for calculations
# if you have 1000+ columns(features) and all are required
# skip the following line, you might instead split your dataframe using slicing,  
# i.e. putting 200 features for each calculation, and then merge the results
new_df = df[[ "userID", "dayID", *map(feature_name, [0,1,2,3]) ]]

# do the calculations
d1 = (new_df.groupby('userID')
            .apply(randx)
            # comment out the following .rename() function if you want to 
            # return list instead of Series
            .rename(feature_name, axis=1)
     )

print(d1)
##
        feature0  feature1  feature2  feature3
userID                                        
gh3          4.0       9.0      59.0      12.0
xy1         24.0       7.0      34.0      10.0

More thoughts:

  1. the list of random row indices that satisfy the requirements can be dished out before running apply(randx). For example, if all userID have the same number of dayIDs, you can use a list of list that preset these row-indices. you can also use a dictionary of lists.

    A reminder: if you use list of lists and L.pop() to dish out the row-indices, make sure the number of lists should be at least number of unique userID + 1, since GroupBy.apply() calls its function twice on the first group

  2. Instead of returning a pd.Series() in the function randx(), you can directly return a list(see the commented return in the function randx()). in such case, all retrieved features will be saved in one column(see below) and you can normalize them later.

    userID
    gh3    [50, 3.0, 59, 15]
    xy1    [30, 7.0, 34, 43]
    
  3. if it's still running slow, you can split 1000+ columns(features) into groups, i.e. process 200 features by each run, slice the predefined row-indices accordingly, and then merge the results.

Update: below a sample test on a VM (Debian-8, 2GB RAM, 1 CPU):

N_users = 100
N_days = 7
N_features = 1000

users = [ 'user{}'.format(i) for i in range(N_users) ]
days  = [ 'day{}'.format(i) for i in range(N_days)   ]
data =  []
for u in users:
    for d in days:
        data.append([ u, d, *np.random.rand(N_features)])

def feature_name(x):
    return 'feature{}'.format(x)

df = pd.DataFrame(data, columns=['userID', 'dayID', *map(feature_name, range(N_features))])

def randx_to_series(dfg):
    x = [ *range(N_days), *np.random.randint(N_days, size=N_features-N_days) ]
    np.random.shuffle(x)
    return pd.Series([ dfg.iat[j,i] for i,j in enumerate(x,2) ])

def randx_to_list(dfg):
    x = [ *range(N_days), *np.random.randint(N_days, size=N_features-N_days) ]
    np.random.shuffle(x)
    return [ dfg.iat[j,i] for i,j in enumerate(x,2) ]

In [133]: %timeit d1 = df.groupby('userID').apply(randx_to_series)
7.82 s +/- 202 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

In [134]: %timeit d1 = df.groupby('userID').apply(randx_to_list)
7.7 s +/- 47.2 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

In [135]: %timeit d1 = df.groupby('userID').agg(lambda x: np.random.choice(x,1))
8.18 s +/- 31.1 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

# new test: calling np.random.choice() w/o using the lambda is much faster
In [xxx]: timeit d1 = df.groupby('userID').agg(np.random.choice)
4.63 s +/- 24.7 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

The speed is however similar to your original method using agg(np.random.choice()), but that one is theoretically not correct. you might have to define what is exactly slow in your expectation.

more tests on randx_to_series():

with 2000 features, thus total 2002 columns:
%%timeit
%run ../../../pandas/randomchoice-2-example.py
...:
15.8 s +/- 225 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

with 5000 features, thus total 5002 columns:
%%timeit
%run ../../../pandas/randomchoice-2-example.py
...:
39.3 s +/- 628 ms per loop (mean +/- std. dev. of 7 runs, 1 loop each)

with 10000 features, thus 10002 columns:
%%timeit
%run ../../../pandas/randomchoice-2-example.py
...:     
1min 21s +/- 1.73 s per loop (mean +/- std. dev. of 7 runs, 1 loop each)

Hope this helps.

Environment: Python 3.6.4, Pandas 0.22.0

like image 83
jxc Avatar answered Oct 12 '22 23:10

jxc