Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using apply function works, however using .assign() the same function does not?

I am a bit stuck, I have a working function that can be utilised using .apply(), however, I cannot seem to get it to work with .assign(). I'd like this to work with assign, so I can chain a number of transformations together.

Could anyone point me in the right direction to resolving the issue?

This works

data = {'heading': ['some men', 'some men', 'some women']}

dataframe = pd.DataFrame(data=data)

def add_gender(x):
    if re.search("(womens?)", x.heading, re.IGNORECASE):
        return 'women'
    elif re.search("(mens?)", x.heading, re.IGNORECASE):
        return 'men'
    else:
        return 'unisex'

dataframe['g'] = dataframe.apply(lambda ref: add_gender(ref), axis=1)

This does not work

dataframe = dataframe.assign(gender = lambda ref: add_gender(ref))

TypeError: expected string or bytes-like object

Is this because .assign() does not provide an axis argument? So perhaps the function is not looking for the right thing?

Having read the documentation .assign states you can generate a new column, so I assumed the output would be the same as .apply(axis=1)

like image 371
dimButTries Avatar asked Dec 12 '21 13:12

dimButTries


People also ask

How does the apply function work in SQL?

The apply function takes data frames as input and can be applied by the rows or by the columns of a data frame. First, I’ll show how to use the apply function by row:

What is the difference between assign () and apply () methods in pandas?

assign () method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones. Existing columns that are re-assigned will be overwritten. Apply a function along an axis of the DataFrame. apply () allow the users to pass a function and apply it on every single value of the Pandas series.

Are the apply functions type-safe?

Please note that the apply functions (except vapply) are not type-safe. This can lead to problems when they are used within functions. For these cases, the map_ type () functions of the purrr package might be a better choice.

How do I use the apply function in R?

The apply function takes data frames as input and can be applied by the rows or by the columns of a data frame. First, I’ll show how to use the apply function by row: As you can see based on the previous R code, we specified three arguments within the apply function: The name of our data frame (i.e. my_data).


1 Answers

From the documentation of DataFrame.assign:

DataFrame.assign(**kwargs)

(...)

Parameters **kwargs : dict of {str: callable or Series}

The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

This means that in

dataframe = dataframe.assign(gender=lambda ref: add_gender(ref))

ref stands for the calling DataFrame, i.e. dataframe, and thus you are passing the whole dataframe to the function add_gender. However, according to how it's defined, add_gender expects a single row (Series object) to be passed as the argument x, not the whole DataFrame.

if re.search("(womens?)", x.heading, re.IGNORECASE):

In the case of assign, x.heading stands for the whole column heading of dataframe (x), which is a Series object. However, re.search only works with string or bytes-like objects, so the error is raised. While in the case of apply, x.heading corresponds to the field heading of each individual row x of dataframe, which are string values.

To solve this just use assign with apply. Note that the lambda in lambda ref: add_gender(ref) is redundant, it's equivalent to just passing add_gender.

dataframe = dataframe.assign(gender=lambda df: df.apply(add_gender, axis=1))

As a suggestion, here is a more concise way of defining add_gender, using Series.str.extract and Series.fillna.

def add_gender(df):
    pat = r'\b(men|women)s?\b'
    return df['heading'].str.extract(pat, flags=re.IGNORECASE).fillna('unisex')

Regarding the regex pattern '\b(men|women)s?\b':

  • \b matches a word boundary
  • (men|women) matches men or women literally and captures the group
  • s? matches s zero or one times

Series.str.extract extract the capture group of each string value of the column heading. Non-matches are set to NaN. Then, Series.fillna replaces the NaNs with 'unisex'.

In this case, add_gender expects the whole DataFrame to be passed. With this definition, you can simply do

dataframe = dataframe.assign(gender=add_gender)

Setup:

import pandas as pd
import re

data = {'heading': ['some men', 'some men', 'some women', 'x mens', 'y womens',  'other', 'blahmenblah', 'blahwomenblah']}
dataframe = pd.DataFrame(data=data)

def add_gender(df):
    pat = r'\b(men|women)s?\b'
    return df['heading'].str.extract(pat, flags=re.IGNORECASE).fillna('unisex')

Output:

>>> dataframe 

         heading
0       some men
1       some men
2     some women
3         x mens
4       y womens
5          other
6    blahmenblah
7  blahwomenblah

>>> dataframe = dataframe.assign(gender = add_gender)
>>> dataframe 

         heading  gender
0       some men     men
1       some men     men
2     some women   women
3         x mens     men
4       y womens   women
5          other  unisex
6    blahmenblah  unisex
7  blahwomenblah  unisex
like image 143
Rodalm Avatar answered Oct 13 '22 15:10

Rodalm