I have following pandas dataframe :
import pandas as pd
from pandas import Series, DataFrame
data = DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'],
'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'],
'Qu3': ['apple', 'potato', 'sausage', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'egg']})
I'd like to change values in columns Qu1,Qu2,Qu3 according to value_counts() when value count great or equal some number
For example for Qu1 column
>>> pd.value_counts(data.Qu1) >= 2
cheese True
potato True
banana True
apple False
egg False
I'd like to keep values cheese,potato,banana, because each value has at least two appearances.
From values apple and egg I'd like to create valueothers
For column Qu2 no changes :
>>> pd.value_counts(data.Qu2) >= 2
banana True
apple True
sausage True
The final result as in attached test_data
test_data = DataFrame({'Qu1': ['other', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'other'],
'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'],
'Qu3': ['other', 'potato', 'other', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'other']})
Thanks !
Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.
DataFrame. replace() function is used to replace values in column (one value with another value on all columns).
Using iloc() method to update the value of a row With the Python iloc() method, it is possible to change or update the value of a row/column by providing the index values of the same. In this example, we have updated the value of the rows 0, 1, 3 and 6 with respect to the first column i.e. 'Num' to 100.
I would create a dataframe of same shape where the corresponding entry is the value count:
data.apply(lambda x: x.map(x.value_counts()))
Out[229]:
Qu1 Qu2 Qu3
0 1 2 1
1 2 4 3
2 3 3 1
3 2 3 3
4 3 3 3
5 2 2 3
6 3 4 3
7 2 4 3
8 1 4 1
And, use the results in df.where to return "other" where the corresponding entry is smaller than 2:
data.where(data.apply(lambda x: x.map(x.value_counts()))>=2, "other")
Qu1 Qu2 Qu3
0 other sausage other
1 potato banana potato
2 cheese apple other
3 banana apple cheese
4 cheese apple cheese
5 banana sausage potato
6 cheese banana cheese
7 potato banana potato
8 other banana other
You could:
value_counts = df.apply(lambda x: x.value_counts())
Qu1 Qu2 Qu3
apple 1.0 3.0 1.0
banana 2.0 4.0 NaN
cheese 3.0 NaN 3.0
egg 1.0 NaN 1.0
potato 2.0 NaN 3.0
sausage NaN 2.0 1.0
Then build a dictionary that will contain the replacements for each column:
import cycle
replacements = {}
for col, s in value_counts.items():
if s[s<2].any():
replacements[col] = dict(zip(s[s < 2].index.tolist(), cycle(['other'])))
replacements
{'Qu1': {'egg': 'other', 'apple': 'other'}, 'Qu3': {'egg': 'other', 'apple': 'other', 'sausage': 'other'}}
Use the dictionary to replace the values:
df.replace(replacements)
Qu1 Qu2 Qu3
0 other sausage other
1 potato banana potato
2 cheese apple other
3 banana apple cheese
4 cheese apple cheese
5 banana sausage potato
6 cheese banana cheese
7 potato banana potato
8 other banana other
or wrap the loop in a dictionary comprehension:
from itertools import cycle
df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})
However, this is not only more cumbersome but also slower than using .where. Testing with 3,000 columns:
df = pd.concat([df for i in range(1000)], axis=1)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Columns: 3000 entries, Qu1 to Qu3
dtypes: object(3000)
Using .replace():
%%timeit
value_counts = df.apply(lambda x: x.value_counts())
df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})
1 loop, best of 3: 4.97 s per loop
vs .where():
%%timeit
df.where(df.apply(lambda x: x.map(x.value_counts()))>=2, "other")
1 loop, best of 3: 2.01 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With