I have following pandas dataframe :
import pandas as pd
from pandas import Series, DataFrame
data = DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'],
'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'],
'Qu3': ['apple', 'potato', 'sausage', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'egg']})
I'd like to change values in columns Qu1
,Qu2
,Qu3
according to value_counts()
when value count great or equal some number
For example for Qu1
column
>>> pd.value_counts(data.Qu1) >= 2
cheese True
potato True
banana True
apple False
egg False
I'd like to keep values cheese
,potato
,banana
, because each value has at least two appearances.
From values apple
and egg
I'd like to create valueothers
For column Qu2
no changes :
>>> pd.value_counts(data.Qu2) >= 2
banana True
apple True
sausage True
The final result as in attached test_data
test_data = DataFrame({'Qu1': ['other', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'other'],
'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'],
'Qu3': ['other', 'potato', 'other', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'other']})
Thanks !
Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.
DataFrame. replace() function is used to replace values in column (one value with another value on all columns).
Using iloc() method to update the value of a row With the Python iloc() method, it is possible to change or update the value of a row/column by providing the index values of the same. In this example, we have updated the value of the rows 0, 1, 3 and 6 with respect to the first column i.e. 'Num' to 100.
I would create a dataframe of same shape where the corresponding entry is the value count:
data.apply(lambda x: x.map(x.value_counts()))
Out[229]:
Qu1 Qu2 Qu3
0 1 2 1
1 2 4 3
2 3 3 1
3 2 3 3
4 3 3 3
5 2 2 3
6 3 4 3
7 2 4 3
8 1 4 1
And, use the results in df.where
to return "other" where the corresponding entry is smaller than 2:
data.where(data.apply(lambda x: x.map(x.value_counts()))>=2, "other")
Qu1 Qu2 Qu3
0 other sausage other
1 potato banana potato
2 cheese apple other
3 banana apple cheese
4 cheese apple cheese
5 banana sausage potato
6 cheese banana cheese
7 potato banana potato
8 other banana other
You could:
value_counts = df.apply(lambda x: x.value_counts())
Qu1 Qu2 Qu3
apple 1.0 3.0 1.0
banana 2.0 4.0 NaN
cheese 3.0 NaN 3.0
egg 1.0 NaN 1.0
potato 2.0 NaN 3.0
sausage NaN 2.0 1.0
Then build a dictionary
that will contain the replacements for each column:
import cycle
replacements = {}
for col, s in value_counts.items():
if s[s<2].any():
replacements[col] = dict(zip(s[s < 2].index.tolist(), cycle(['other'])))
replacements
{'Qu1': {'egg': 'other', 'apple': 'other'}, 'Qu3': {'egg': 'other', 'apple': 'other', 'sausage': 'other'}}
Use the dictionary
to replace the values:
df.replace(replacements)
Qu1 Qu2 Qu3
0 other sausage other
1 potato banana potato
2 cheese apple other
3 banana apple cheese
4 cheese apple cheese
5 banana sausage potato
6 cheese banana cheese
7 potato banana potato
8 other banana other
or wrap the loop in a dictionary
comprehension:
from itertools import cycle
df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})
However, this is not only more cumbersome but also slower than using .where
. Testing with 3,000 columns:
df = pd.concat([df for i in range(1000)], axis=1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Columns: 3000 entries, Qu1 to Qu3
dtypes: object(3000)
Using .replace()
:
%%timeit
value_counts = df.apply(lambda x: x.value_counts())
df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})
1 loop, best of 3: 4.97 s per loop
vs .where()
:
%%timeit
df.where(df.apply(lambda x: x.map(x.value_counts()))>=2, "other")
1 loop, best of 3: 2.01 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With