Best way to process a click stream to create features in Pandas

Tags:

pandas

I am processing a dataframe with a click-stream and I'm extracting features for each user in the click-stream to be used in a Machine Learning project.

The dataframe is something like this:

data = pd.DataFrame({'id':['A01','B01','A01','C01','A01','B01','A01'],
                     'event':['search','search','buy','home','cancel','home','search'],
                     'date':['2018-01-01','2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-04','2018-01-06'],
                     'product':['tablet','dvd','tablet','tablet','tablet','book','book'],
                     'price': [103,2,203,103,203,21,21]})
data['date'] = pd.to_datetime(data['date'])

Since I have to create features for each user I'm using a groupby/apply with a custom function like:

featurized = data.groupby('id').apply(featurize)

Create user features will take a chunk of the dataframe and create many (hundreds) of features. The whole process is just too slow so I'm looking for a recommendation to do this more effciently.

An example of the function used to create features:

def featurize(group):
    features = dict()

    # Userid
    features['id'] = group['id'].max()
    # Feature 1: Number of search events
    features['number_of_search_events'] = (group['event']=='search').sum()
    # Feature 2: Number of tablets
    features['number_of_tablets'] = (group['product']=='tablet').sum()
    # Feature 3: Total time
    features['total_time'] = (group['date'].max() - group['date'].min()) / np.timedelta64(1,'D')
    # Feature 4: Total number of events
    features['events'] = len(group)
    # Histogram of products examined
    product_counts = group['product'].value_counts()
    # Feature 5 max events for a product
    features['max_product_events'] = product_counts.max()
    # Feature 6 min events for a product
    features['min_product_events'] = product_counts.min()
    # Feature 7 avg events for a product
    features['mean_product_events'] = product_counts.mean()
    # Feature 8 std events for a product
    features['std_product_events'] = product_counts.std()
    # Feature 9 total price for tablet products
    features['tablet_price_sum'] = group.loc[group['product']=='tablet','price'].sum()
    # Feature 10 max price for tablet products
    features['tablet_price_max'] = group.loc[group['product']=='tablet','price'].max()
    # Feature 11 min price for tablet products
    features['tablet_price_min'] = group.loc[group['product']=='tablet','price'].min()
    # Feature 12 mean price for tablet products
    features['tablet_price_mean'] = group.loc[group['product']=='tablet','price'].mean()
    # Feature 13 std price for tablet products
    features['tablet_price_std'] = group.loc[group['product']=='tablet','price'].std()
    return pd.Series(features)

One potential problem is that each feature potentially scans the whole chunk so if I have 100 features I scan the chunk 100 times instead of just one.

For example a feature can be the number of "tablet" events the user has, other can be the number of "home" events, other can be the average time difference between "search" events, then average time difference between "search" events for "tablets", etc etc. Each feature can be coded as a function that takes a chunk (df) and creates the feature but when we have 100s of features each is scanning the whole chunk when a single linear scan would suffice. The problem is the code would get ugly if I do a manual for loop over each record in the chunk and code all the features in the loop.

Questions:

If I have to process a dataframe hundreds of times, is there a way to abstract this in a single scan that will create all the needed features?
Is there a speed improvement over the groupby/apply approach I'm currently using?

652

asked Jul 04 '18 13:07

user3635284

1 Answers

Disclaimer: the following answer does not properly answer the above question. Just leaving it here for the sake of work invested. Maybe there will be some use for it at some point.

re-use dataframe selections (e.g. group.loc[group['product']=='tablet','price'])
parallelism (e.g. Parallelize apply after pandas groupby; see code below)
use a cache if you run the calculations multiple times (e.g. HDFStore)
avoid string operations; use native types which can be efficiently represented in numpy
if you really need strings, use categorical columns (given they represent categorical data..)
if the frames are really big, consider using chunks (e.g. "Large data" work flows using pandas)
use cython for further (potentially drastical) enhancements

As for (1), given your code from above, I could produce speedups of up to 43% (i7-7700HQ CPU, 16GB RAM).

Timings

using joblib: 68.86841534099949s
using multiprocessing: 71.53540843299925s
single-threaded: 119.05010353899888s

Code

import pandas as pd
import numpy as np
import time
import timeit
import os
import joblib
import multiprocessing


import pandas as pd
import numpy as np
import timeit
import joblib
import multiprocessing


def make_data():
    # just some test data ...
    n_users = 100
    events = ['search', 'buy', 'home', 'cancel']
    products = ['tablet', 'dvd', 'book']
    max_price = 1000

    n_duplicates = 1000
    n_rows = 40000

    df = pd.DataFrame({
        'id': list(map(str, np.random.randint(0, n_users, n_rows))),
        'event': list(map(events.__getitem__, np.random.randint(0, len(events), n_rows))),
        'date': list(map(pd.to_datetime, np.random.randint(0, 100000, n_rows))),
        'product': list(map(products.__getitem__, np.random.randint(0, len(products), n_rows))),
        'price': np.random.random(n_rows) * max_price
    })
    df = pd.concat([df for _ in range(n_duplicates)])
    df.to_pickle('big_df.pkl')
    return df


def data():
    return pd.read_pickle('big_df.pkl')


def featurize(group):
    features = dict()

    # Feature 1: Number of search events
    features['number_of_search_events'] = (group['event'] == 'search').sum()
    # Feature 2: Number of tablets
    features['number_of_tablets'] = (group['product'] == 'tablet').sum()
    # Feature 3: Total time
    features['total_time'] = (group['date'].max() - group['date'].min()) / np.timedelta64(1, 'D')
    # Feature 4: Total number of events
    features['events'] = len(group)
    # Histogram of products examined
    product_counts = group['product'].value_counts()
    # Feature 5 max events for a product
    features['max_product_events'] = product_counts.max()
    # Feature 6 min events for a product
    features['min_product_events'] = product_counts.min()
    # Feature 7 avg events for a product
    features['mean_product_events'] = product_counts.mean()
    # Feature 8 std events for a product
    features['std_product_events'] = product_counts.std()
    # Feature 9 total price for tablet products
    features['tablet_price_sum'] = group.loc[group['product'] == 'tablet', 'price'].sum()
    # Feature 10 max price for tablet products
    features['tablet_price_max'] = group.loc[group['product'] == 'tablet', 'price'].max()
    # Feature 11 min price for tablet products
    features['tablet_price_min'] = group.loc[group['product'] == 'tablet', 'price'].min()
    # Feature 12 mean price for tablet products
    features['tablet_price_mean'] = group.loc[group['product'] == 'tablet', 'price'].mean()
    # Feature 13 std price for tablet products
    features['tablet_price_std'] = group.loc[group['product'] == 'tablet', 'price'].std()
    return pd.DataFrame.from_records(features, index=[group['id'].max()])


# https://stackoverflow.com/questions/26187759/parallelize-apply-after-pandas-groupby
def apply_parallel_job(dfGrouped, func):
    retLst = joblib.Parallel(n_jobs=multiprocessing.cpu_count())(
        joblib.delayed(func)(group) for name, group in dfGrouped)
    return pd.concat(retLst)


def apply_parallel_pool(dfGrouped, func):
    with multiprocessing.Pool(multiprocessing.cpu_count()) as p:
        ret_list = list(p.map(func, [group for name, group in dfGrouped]))
    return pd.concat(ret_list)


featurized_job = lambda df: apply_parallel_job(df.groupby('id'), featurize)
featurized_pol = lambda df: apply_parallel_pool(df.groupby('id'), featurize)
featurized_sng = lambda df: df.groupby('id').apply(featurize)

make_data()
print(timeit.timeit("featurized_job(data())", "from __main__ import featurized_job, data", number=3))
print(timeit.timeit("featurized_sng(data())", "from __main__ import featurized_sng, data", number=3))
print(timeit.timeit("featurized_pol(data())", "from __main__ import featurized_pol, data", number=3))

As for (7), consider the following refactorization:

Timings

original: 112.0091859719978s
re-used prices: 83.85681765000118s

Code

# [...]
prices_ = group.loc[group['product'] == 'tablet', 'price']
features['tablet_price_sum'] = prices_.sum()
# Feature 10 max price for tablet products
features['tablet_price_max'] = prices_.max()
# Feature 11 min price for tablet products
features['tablet_price_min'] = prices_.min()
# Feature 12 mean price for tablet products
features['tablet_price_mean'] = prices_.mean()
# Feature 13 std price for tablet products
features['tablet_price_std'] = prices_.std()
# [...]

answered Oct 03 '22 23:10

Michael Hoff

Related questions
                            
                                Import TensorFlow data from pyspark
                            
                                Setting global python version with pyenv is not working as expected
                            
                                openCV template matching error: (-215)
                            
                                How to remove a residual plot in Jupyter output after displaying a matplotlib animation?
                            
                                Extract Characters using convex Hull coordinates - opencv - python
                            
                                flask-marshmallow: how to mark all fields as optional only when the method is PUT
                            
                                Why does augmented assignment behave differently when adding a string to a list [duplicate]
                            
                                Throwing ZeroDivisionError
                            
                                paste0 like function in python for multiple strings
                            
                                How to get the wikipedia corpus text with punctuation by using gensim wikicorpus?
                            
                                Three-way comparing strings in Python 3
                            
                                Numpy multiply 3d matrix by 2d matrix
                            
                                Generate combinations of values from rolling window in Pandas
                            
                                Is there a way to call await directly in Jupyter cell?
                            
                                Inference with a model trained with tf.Dataset
                            
                                Django- Change Username field to BigAutoField?
                            
                                Import binary package from different directory
                            
                                How is the usage of @classmethod causing difference in outputs?
                            
                                Why is eval('"\x27"') == eval('"\\x27"')?
                            
                                python template with default value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With