I have following pandas dataframe : <pre class="prettyprint"><code>import pandas as pd from pandas import Series, DataFrame data = DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'], 'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'], 'Qu3': ['apple', 'potato', 'sausage', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'egg']}) </code></pre> I'd like to change values in columns <code>Qu1</code>,<code>Qu2</code>,<code>Qu3</code> according to <code>value_counts()</code> when value count great or equal some number For example for <code>Qu1</code> column <pre class="prettyprint"><code>>>> pd.value_counts(data.Qu1) >= 2 cheese True potato True banana True apple False egg False </code></pre> I'd like to keep values <code>cheese</code>,<code>potato</code>,<code>banana</code>, because each value has at least two appearances. From values <code>apple</code> and <code>egg</code> I'd like to create value<code>others</code> For column <code>Qu2</code> no changes : <pre class="prettyprint"><code>>>> pd.value_counts(data.Qu2) >= 2 banana True apple True sausage True </code></pre> The final result as in attached <code>test_data</code> <pre class="prettyprint"><code>test_data = DataFrame({'Qu1': ['other', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'other'], 'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'], 'Qu3': ['other', 'potato', 'other', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'other']}) </code></pre> Thanks !

I would create a dataframe of same shape where the corresponding entry is the value count: <pre class="prettyprint"><code>data.apply(lambda x: x.map(x.value_counts())) Out[229]: Qu1 Qu2 Qu3 0 1 2 1 1 2 4 3 2 3 3 1 3 2 3 3 4 3 3 3 5 2 2 3 6 3 4 3 7 2 4 3 8 1 4 1 </code></pre> And, use the results in <code>df.where</code> to return "other" where the corresponding entry is smaller than 2: <pre class="prettyprint"><code>data.where(data.apply(lambda x: x.map(x.value_counts()))>=2, "other") Qu1 Qu2 Qu3 0 other sausage other 1 potato banana potato 2 cheese apple other 3 banana apple cheese 4 cheese apple cheese 5 banana sausage potato 6 cheese banana cheese 7 potato banana potato 8 other banana other </code></pre>

You could: <pre class="prettyprint"><code>value_counts = df.apply(lambda x: x.value_counts()) Qu1 Qu2 Qu3 apple 1.0 3.0 1.0 banana 2.0 4.0 NaN cheese 3.0 NaN 3.0 egg 1.0 NaN 1.0 potato 2.0 NaN 3.0 sausage NaN 2.0 1.0 </code></pre> Then build a <code>dictionary</code> that will contain the replacements for each column: <pre class="prettyprint"><code>import cycle replacements = {} for col, s in value_counts.items(): if s[s<2].any(): replacements[col] = dict(zip(s[s < 2].index.tolist(), cycle(['other']))) replacements {'Qu1': {'egg': 'other', 'apple': 'other'}, 'Qu3': {'egg': 'other', 'apple': 'other', 'sausage': 'other'}} </code></pre> Use the <code>dictionary</code> to replace the values: <pre class="prettyprint"><code>df.replace(replacements) Qu1 Qu2 Qu3 0 other sausage other 1 potato banana potato 2 cheese apple other 3 banana apple cheese 4 cheese apple cheese 5 banana sausage potato 6 cheese banana cheese 7 potato banana potato 8 other banana other </code></pre> or wrap the loop in a <code>dictionary</code> comprehension: <pre class="prettyprint"><code>from itertools import cycle df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()}) </code></pre> However, this is not only more cumbersome but also slower than using <code>.where</code>. Testing with 3,000 columns: <pre class="prettyprint"><code>df = pd.concat([df for i in range(1000)], axis=1) <class 'pandas.core.frame.DataFrame'> RangeIndex: 9 entries, 0 to 8 Columns: 3000 entries, Qu1 to Qu3 dtypes: object(3000) </code></pre> Using <code>.replace()</code>: <pre class="prettyprint"><code>%%timeit value_counts = df.apply(lambda x: x.value_counts()) df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()}) 1 loop, best of 3: 4.97 s per loop </code></pre> vs <code>.where()</code>: <pre class="prettyprint"><code>%%timeit df.where(df.apply(lambda x: x.map(x.value_counts()))>=2, "other") 1 loop, best of 3: 2.01 s per loop </code></pre>

Change values in pandas dataframe according to value_counts()

Tags:

python

pandas

python-2.7

I have following pandas dataframe :

import pandas as pd 
from pandas import Series, DataFrame

data = DataFrame({'Qu1': ['apple', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'egg'],
              'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'],
              'Qu3': ['apple', 'potato', 'sausage', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'egg']})

I'd like to change values in columns Qu1,Qu2,Qu3 according to value_counts() when value count great or equal some number

For example for Qu1 column

>>> pd.value_counts(data.Qu1) >= 2
cheese     True
potato     True
banana     True
apple     False
egg       False

I'd like to keep values cheese,potato,banana, because each value has at least two appearances.

From values apple and egg I'd like to create valueothers

For column Qu2 no changes :

>>> pd.value_counts(data.Qu2) >= 2
banana     True
apple      True
sausage    True

The final result as in attached test_data

test_data = DataFrame({'Qu1': ['other', 'potato', 'cheese', 'banana', 'cheese', 'banana', 'cheese', 'potato', 'other'],
                  'Qu2': ['sausage', 'banana', 'apple', 'apple', 'apple', 'sausage', 'banana', 'banana', 'banana'],
                  'Qu3': ['other', 'potato', 'other', 'cheese', 'cheese', 'potato', 'cheese', 'potato', 'other']})

Thanks !

389

asked May 15 '16 14:05

Toren

2 Answers

I would create a dataframe of same shape where the corresponding entry is the value count:

data.apply(lambda x: x.map(x.value_counts()))
Out[229]: 
   Qu1  Qu2  Qu3
0    1    2    1
1    2    4    3
2    3    3    1
3    2    3    3
4    3    3    3
5    2    2    3
6    3    4    3
7    2    4    3
8    1    4    1

And, use the results in df.where to return "other" where the corresponding entry is smaller than 2:

data.where(data.apply(lambda x: x.map(x.value_counts()))>=2, "other")

      Qu1      Qu2     Qu3
0   other  sausage   other
1  potato   banana  potato
2  cheese    apple   other
3  banana    apple  cheese
4  cheese    apple  cheese
5  banana  sausage  potato
6  cheese   banana  cheese
7  potato   banana  potato
8   other   banana   other

121

answered Sep 30 '22 05:09

ayhan

You could:

value_counts = df.apply(lambda x: x.value_counts())

         Qu1  Qu2  Qu3
apple    1.0  3.0  1.0
banana   2.0  4.0  NaN
cheese   3.0  NaN  3.0
egg      1.0  NaN  1.0
potato   2.0  NaN  3.0
sausage  NaN  2.0  1.0

Then build a dictionary that will contain the replacements for each column:

import cycle
replacements = {}
for col, s in value_counts.items():
    if s[s<2].any():
        replacements[col] = dict(zip(s[s < 2].index.tolist(), cycle(['other'])))

replacements
{'Qu1': {'egg': 'other', 'apple': 'other'}, 'Qu3': {'egg': 'other', 'apple': 'other', 'sausage': 'other'}}

Use the dictionary to replace the values:

df.replace(replacements)

      Qu1      Qu2     Qu3
0   other  sausage   other
1  potato   banana  potato
2  cheese    apple   other
3  banana    apple  cheese
4  cheese    apple  cheese
5  banana  sausage  potato
6  cheese   banana  cheese
7  potato   banana  potato
8   other   banana   other

or wrap the loop in a dictionary comprehension:

from itertools import cycle

df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})

However, this is not only more cumbersome but also slower than using .where. Testing with 3,000 columns:

df = pd.concat([df for i in range(1000)], axis=1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Columns: 3000 entries, Qu1 to Qu3
dtypes: object(3000)

Using .replace():

%%timeit
value_counts = df.apply(lambda x: x.value_counts())
df.replace({col: dict(zip(s[s < 2].index.tolist(), cycle(['other']))) for col, s in value_counts.items() if s[s < 2].any()})

1 loop, best of 3: 4.97 s per loop

vs .where():

%%timeit
df.where(df.apply(lambda x: x.map(x.value_counts()))>=2, "other")

1 loop, best of 3: 2.01 s per loop

answered Sep 30 '22 06:09

Stefan

Related questions
                            
                                How to disable log messages from the Requests library in Django
                            
                                Delete rest of string after n-th occurence
                            
                                Django Queryset for substr matching that starts from the beginning of a string
                            
                                Using pandas to_datetime with timestamps
                            
                                printing list of categories as a column
                            
                                How to round QWidget corners
                            
                                Canot instantiate boto3 S3Transfer class
                            
                                How to find an original text representation for lower precision float values in Python?
                            
                                Why does my openpyxl WorkBook.active() give me a TypeError: Object is not callable?
                            
                                How to save picture boxplot seaborn
                            
                                Does anybody know how to identify shadow dom web elements using selenium webdriver?
                            
                                Get Group Match in re.sub in Python
                            
                                Avoiding repeated legend in seaborn boxplot overlaid by swarmplot
                            
                                Opencv python HoughLinesP strange results
                            
                                Python: How to generate a random phone number?
                            
                                Python, Tkinter, Change of label color
                            
                                How to incorporate SSML into Python
                            
                                Python - How to parse argv on the command line using stdin/stdout?
                            
                                How can i use celery with different code base in API and workers
                            
                                how to setup a simple microservices architecture in app engine

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With