I am trying to perform some linear regression analysis, I have some categorical features that i convert to dummy variables using the super awesome get_dummies. The issue I face is, the dataframe gets too big when I add all the elements of the categories. Is there a way (using get_dummies or a more elaborate method) to just create dummy variables of the most frequent terms instead of all of them?

I used the answer that @HYRY gave to write a function that will have a parameter (threshold) that can be used to separate the popular values and unpopular ones (combined in 'others' column). <pre class="prettyprint"><code>import pandas as pd import numpy as np # func that returns a dummified DataFrame of significant dummies in a given column def dum_sign(dummy_col, threshold=0.1): # removes the bind dummy_col = dummy_col.copy() # what is the ratio of a dummy in whole column count = pd.value_counts(dummy_col) / len(dummy_col) # cond whether the ratios is higher than the threshold mask = dummy_col.isin(count[count > threshold].index) # replace the ones which ratio is lower than the threshold by a special name dummy_col[~mask] = "others" return pd.get_dummies(dummy_col, prefix=dummy_col.name) # </code></pre> Let's create some data: <pre class="prettyprint"><code>df = ['a', 'a', np.nan, np.nan, 'a', np.nan, 'a', 'b', 'b', 'b', 'b', 'b', 'c', 'c', 'd', 'e', 'g', 'g', 'g', 'g'] data = pd.Series(df, name='dums') </code></pre> Examples of use: <pre class="prettyprint"><code> In: dum_sign(data) Out: dums_a dums_b dums_g dums_others 0 1 0 0 0 1 1 0 0 0 2 0 0 0 1 3 0 0 0 1 4 1 0 0 0 5 0 0 0 1 6 1 0 0 0 7 0 1 0 0 8 0 1 0 0 9 0 1 0 0 10 0 1 0 0 11 0 1 0 0 12 0 0 0 1 13 0 0 0 1 14 0 0 0 1 15 0 0 0 1 16 0 0 1 0 17 0 0 1 0 18 0 0 1 0 19 0 0 1 0 In: dum_sign(data, threshold=0.2) Out: dums_b dums_others 0 0 1 1 0 1 2 0 1 3 0 1 4 0 1 5 0 1 6 0 1 7 1 0 8 1 0 9 1 0 10 1 0 11 1 0 12 0 1 13 0 1 14 0 1 15 0 1 16 0 1 17 0 1 18 0 1 19 0 1 In: dum_sign(data, threshold=0) Out: dums_a dums_b dums_c dums_d dums_e dums_g dums_others 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 2 0 0 0 0 0 0 1 3 0 0 0 0 0 0 1 4 1 0 0 0 0 0 0 5 0 0 0 0 0 0 1 6 1 0 0 0 0 0 0 7 0 1 0 0 0 0 0 8 0 1 0 0 0 0 0 9 0 1 0 0 0 0 0 10 0 1 0 0 0 0 0 11 0 1 0 0 0 0 0 12 0 0 1 0 0 0 0 13 0 0 1 0 0 0 0 14 0 0 0 1 0 0 0 15 0 0 0 0 1 0 0 16 0 0 0 0 0 1 0 17 0 0 0 0 0 1 0 18 0 0 0 0 0 1 0 19 0 0 0 0 0 1 0 </code></pre> Any suggestions how to handle nans? I believe that nans should not be treated as 'others'. UPD: I have tested it on a pretty large dataset (5 mil obs) with 183 different strings in the column that I wanted to dummify. The implementation takes 10 sec max on my laptop.

Get subset of most frequent dummy variables in pandas

1 Answers

I used the answer that @HYRY gave to write a function that will have a parameter (threshold) that can be used to separate the popular values and unpopular ones (combined in 'others' column).

import pandas as pd
import numpy as np

# func that returns a dummified DataFrame of significant dummies in a given column
def dum_sign(dummy_col, threshold=0.1):

    # removes the bind
    dummy_col = dummy_col.copy()

    # what is the ratio of a dummy in whole column
    count = pd.value_counts(dummy_col) / len(dummy_col)

    # cond whether the ratios is higher than the threshold
    mask = dummy_col.isin(count[count > threshold].index)

    # replace the ones which ratio is lower than the threshold by a special name
    dummy_col[~mask] = "others"

    return pd.get_dummies(dummy_col, prefix=dummy_col.name)
#

Let's create some data:

df = ['a', 'a', np.nan, np.nan, 'a', np.nan, 'a', 'b', 'b', 'b', 'b', 'b', 
             'c', 'c', 'd', 'e', 'g', 'g', 'g', 'g']

data = pd.Series(df, name='dums')

Examples of use:

 In: dum_sign(data)
Out:
    dums_a  dums_b  dums_g  dums_others
0        1       0       0            0
1        1       0       0            0
2        0       0       0            1
3        0       0       0            1
4        1       0       0            0
5        0       0       0            1
6        1       0       0            0
7        0       1       0            0
8        0       1       0            0
9        0       1       0            0
10       0       1       0            0
11       0       1       0            0
12       0       0       0            1
13       0       0       0            1
14       0       0       0            1
15       0       0       0            1
16       0       0       1            0
17       0       0       1            0
18       0       0       1            0
19       0       0       1            0

 In: dum_sign(data, threshold=0.2)
Out: 
    dums_b  dums_others
0        0            1
1        0            1
2        0            1
3        0            1
4        0            1
5        0            1
6        0            1
7        1            0
8        1            0
9        1            0
10       1            0
11       1            0
12       0            1
13       0            1
14       0            1
15       0            1
16       0            1
17       0            1
18       0            1
19       0            1

 In: dum_sign(data, threshold=0)
Out: 
    dums_a  dums_b  dums_c  dums_d  dums_e  dums_g  dums_others
0        1       0       0       0       0       0            0
1        1       0       0       0       0       0            0
2        0       0       0       0       0       0            1
3        0       0       0       0       0       0            1
4        1       0       0       0       0       0            0
5        0       0       0       0       0       0            1
6        1       0       0       0       0       0            0
7        0       1       0       0       0       0            0
8        0       1       0       0       0       0            0
9        0       1       0       0       0       0            0
10       0       1       0       0       0       0            0
11       0       1       0       0       0       0            0
12       0       0       1       0       0       0            0
13       0       0       1       0       0       0            0
14       0       0       0       1       0       0            0
15       0       0       0       0       1       0            0
16       0       0       0       0       0       1            0
17       0       0       0       0       0       1            0
18       0       0       0       0       0       1            0
19       0       0       0       0       0       1            0

Any suggestions how to handle nans? I believe that nans should not be treated as 'others'.

UPD: I have tested it on a pretty large dataset (5 mil obs) with 183 different strings in the column that I wanted to dummify. The implementation takes 10 sec max on my laptop.

175

answered Sep 26 '22 05:09

Vladimir Iashin

Related questions
                            
                                Defining a global function in a Python script
                            
                                Find Max in Nested Dictionary
                            
                                What are the use cases for Python's __new__?
                            
                                Debugging slow Django Admin views [closed]
                            
                                Python Object as Dictionary Value [closed]
                            
                                Python image library (PIL), how to compress image into desired file size?
                            
                                Django get_or_create, how to say commit=False
                            
                                Convert JSON to CSV using Python (Idle)
                            
                                Pymongo cursor limit(1) returns more than 1 result
                            
                                Including Local Variables in Django Error Emails
                            
                                How to specify in YAML to always create log file in the project's folder using dictConfig?
                            
                                how to access form data using flask?
                            
                                How to implement a watchdog timer in Python?
                            
                                Where I should put my python scripts in Linux?
                            
                                How to determine the learning rate and the variance in a gradient descent algorithm？
                            
                                set ipython's default scientific notation threshold
                            
                                Ubuntu add directory to Python path
                            
                                Parsing through edges in NetworkX graph
                            
                                Better Function Composition in Python
                            
                                matplotlib bitmap plot with vector text

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get subset of most frequent dummy variables in pandas

Tags:

python

pandas

Manuel G

People also ask

1 Answers

Vladimir Iashin

Recent Activity

Donate For Us