I am a bit stuck, I have a working function that can be utilised using .apply()
, however, I cannot seem to get it to work with .assign()
. I'd like this to work with assign, so I can chain a number of transformations together.
Could anyone point me in the right direction to resolving the issue?
This works
data = {'heading': ['some men', 'some men', 'some women']}
dataframe = pd.DataFrame(data=data)
def add_gender(x):
if re.search("(womens?)", x.heading, re.IGNORECASE):
return 'women'
elif re.search("(mens?)", x.heading, re.IGNORECASE):
return 'men'
else:
return 'unisex'
dataframe['g'] = dataframe.apply(lambda ref: add_gender(ref), axis=1)
This does not work
dataframe = dataframe.assign(gender = lambda ref: add_gender(ref))
TypeError: expected string or bytes-like object
Is this because .assign()
does not provide an axis argument? So perhaps the function is not looking for the right thing?
Having read the documentation .assign
states you can generate a new column, so I assumed the output would be the same as .apply(axis=1)
The apply function takes data frames as input and can be applied by the rows or by the columns of a data frame. First, I’ll show how to use the apply function by row:
assign () method assign new columns to a DataFrame, returning a new object (a copy) with the new columns added to the original ones. Existing columns that are re-assigned will be overwritten. Apply a function along an axis of the DataFrame. apply () allow the users to pass a function and apply it on every single value of the Pandas series.
Please note that the apply functions (except vapply) are not type-safe. This can lead to problems when they are used within functions. For these cases, the map_ type () functions of the purrr package might be a better choice.
The apply function takes data frames as input and can be applied by the rows or by the columns of a data frame. First, I’ll show how to use the apply function by row: As you can see based on the previous R code, we specified three arguments within the apply function: The name of our data frame (i.e. my_data).
From the documentation of DataFrame.assign
:
DataFrame.assign(**kwargs)
(...)
Parameters **kwargs : dict of {str: callable or Series}
The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.
This means that in
dataframe = dataframe.assign(gender=lambda ref: add_gender(ref))
ref
stands for the calling DataFrame, i.e. dataframe
, and thus you are passing the whole dataframe
to the function add_gender
. However, according to how it's defined, add_gender
expects a single row (Series
object) to be passed as the argument x
, not the whole DataFrame.
if re.search("(womens?)", x.heading, re.IGNORECASE):
In the case of assign
, x.heading
stands for the whole column heading
of dataframe
(x
), which is a Series
object. However, re.search
only works with string
or bytes
-like objects, so the error is raised. While in the case of apply
, x.heading
corresponds to the field heading
of each individual row x
of dataframe
, which are string
values.
To solve this just use assign
with apply
. Note that the lambda in lambda ref: add_gender(ref)
is redundant, it's equivalent to just passing add_gender
.
dataframe = dataframe.assign(gender=lambda df: df.apply(add_gender, axis=1))
As a suggestion, here is a more concise way of defining add_gender
, using Series.str.extract
and Series.fillna
.
def add_gender(df):
pat = r'\b(men|women)s?\b'
return df['heading'].str.extract(pat, flags=re.IGNORECASE).fillna('unisex')
Regarding the regex pattern '\b(men|women)s?\b'
:
\b
matches a word boundary(men|women)
matches men
or women
literally and captures the groups?
matches s
zero or one timesSeries.str.extract
extract the capture group of each string value of the column heading
. Non-matches are set to NaN. Then, Series.fillna
replaces the NaNs with 'unisex'.
In this case, add_gender
expects the whole DataFrame to be passed. With this definition, you can simply do
dataframe = dataframe.assign(gender=add_gender)
Setup:
import pandas as pd
import re
data = {'heading': ['some men', 'some men', 'some women', 'x mens', 'y womens', 'other', 'blahmenblah', 'blahwomenblah']}
dataframe = pd.DataFrame(data=data)
def add_gender(df):
pat = r'\b(men|women)s?\b'
return df['heading'].str.extract(pat, flags=re.IGNORECASE).fillna('unisex')
Output:
>>> dataframe
heading
0 some men
1 some men
2 some women
3 x mens
4 y womens
5 other
6 blahmenblah
7 blahwomenblah
>>> dataframe = dataframe.assign(gender = add_gender)
>>> dataframe
heading gender
0 some men men
1 some men men
2 some women women
3 x mens men
4 y womens women
5 other unisex
6 blahmenblah unisex
7 blahwomenblah unisex
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With