Drop columns with low standard deviation in Pandas Dataframe

Tags:

python

pandas

Is there any way of doing this without writing a for loop?

Suppose we have the following data:

d = {'A': {-1: 0.19052041339798062,
      0: -0.0052531481871952871,
      1: -0.0022017467720961644,
      2: -0.051109629013311737,
      3: 0.18569441222621336},
     'B': {-1: 0.029181417300734112,
      0: -0.0031021862533310743,
      1: -0.014358516787430284,
      2: 0.0046386615308068877,
      3: 0.056676322314857898},
     'C': {-1: 0.071883343375205785,
      0: -0.011930096520251999,
      1: -0.011836365865654104,
      2: -0.0033930358388315237,
      3: 0.11812543193496111},
     'D': {-1: 0.17670604006475121,
      0: -0.088756293654161142,
      1: -0.093383245649534194,
      2: 0.095649943383654359,
      3: 0.51030339029516592},
     'E': {-1: 0.30273513342295627,
      0: -0.30640233455497284,
      1: -0.32698263145105921,
      2: 0.60257484810641992,
      3: 0.36859978928328413},
     'F': {-1: 0.25328469046380131,
      0: -0.063890702001567143,
      1: -0.10007720832198815,
      2: 0.08153164759036724,
      3: 0.36606175240021183},
     'G': {-1: 0.28764606940509913,
      0: -0.11022209861109525,
      1: -0.1264164305949009,
      2: 0.17030074112227081,
      3: 0.30100292424380881}}
df = pd.DataFrame(d)

I know I can get the std values by std_vals = df.std(), which gives the following result, and use these values to drop the columns one by one.

In[]:
        pd.DataFrame(d).std()
Out[]:
        A    0.115374
        B    0.028435
        C    0.059394
        D    0.247617
        E    0.421117
        F    0.200776
        G    0.209710
        dtype: float64

However, I don't know how to use the Pandas indexing to drop the columns with low std values directly.

Is there a way to do this, or I need to loop over each column?

715

asked Aug 04 '15 01:08

Ashkan

1 Answers

You can use the loc method of a dataframe to select certain columns based on a Boolean indexer. Create the indexer like this (uses Numpy Array broadcasting to apply the condition to each column):

df.std() > 0.3

Out[84]: 
A    False
B    False
C    False
D    False
E     True
F    False
G    False
dtype: bool

Then call loc with : in the first position to indicate that you want to return all rows:

df.loc[:, df.std() > .3]
Out[85]: 
           E
-1  0.302735
 0 -0.306402
 1 -0.326983
 2  0.602575
 3  0.368600

answered Sep 26 '22 00:09

maxymoo

Related questions
                            
                                How to test that a function is called within a function with nosetests
                            
                                csv writer in Python with custom quoting
                            
                                Flask hit decorator before before_request signal fires
                            
                                Does KMeans normalize features automatically in sklearn
                            
                                Python Call Parent Method Multiple Inheritance
                            
                                Subheadings for categories within matplotlib custom legend
                            
                                Why using integer as a key with pymongo doesn't work?
                            
                                Best way to get python and meteor talking [closed]
                            
                                Django rest framework user registration?
                            
                                Confidence interval for exponential curve fit
                            
                                Using iGraph in python for community detection and writing community number for each node to CSV
                            
                                numpy: How to add a column to an existing structured array?
                            
                                Find element in list of objects with explicit key value
                            
                                Pandas merge return empty dataframe
                            
                                Python Pandas If value in column B = equals [X, Y, Z] replace column A with "T"
                            
                                Draw a separator or lines between subplots
                            
                                What is a=b=c in python? [duplicate]
                            
                                Include entire directory in python setup.py data_files
                            
                                Django, how to get a user by id, using the django.contrib.auth.models.User
                            
                                How to use variables inside query in Pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With