sum values of columns starting with the same string in pandas dataframe

Tags:

I have a dataframe with about 100 columns that looks like this:

   Id  Economics-1  English-107  English-2  History-3  Economics-zz  Economics-2  \
0  56          1            1          0        1       0           0   
1  11          0            0          0        0       1           0   
2   6          0            0          1        0       0           1   
3  43          0            0          0        1       0           1   
4  14          0            1          0        0       1           0   

   Histo      Economics-51      Literature-re         Literatureu4  
0           1            0           1                0  
1           0            0           0                1  
2           0            0           0                0  
3           0            1           1                0  
4           1            0           0                0

My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":

    Id  Economics      English    History  Literature  
0  56          1            1          2        1                     
1  11          1            0          0        1                    
2   6          0            1          1        0                     
3  43          2            0          1        1                     
4  14          0            1          1        0

For this purpose, I have tried two methods. First method:

df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]

Second method:

df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
    df['History'] = df[filter_col].sum(axes=1)
    print df['History', df[filter_col]]

However, both gives the error:

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.

866

asked Mar 02 '16 12:03

Amanda

1 Answers

You can use these to create sum of columns starting with specific name,

df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)

101

answered Oct 13 '22 18:10

Raghul

Related questions
                            
                                How to use different marker for different point in scatter plot pylab
                            
                                Decorate a function after it is defined?
                            
                                Creating a shell command line application with Python and Click
                            
                                Converting a datetime object to an integer python
                            
                                GPU Accelerated data plotting in Python
                            
                                How to split string without spaces into list of integers in Python? [duplicate]
                            
                                Flask only sees first parameter from multiple parameters sent with curl
                            
                                PyQt4 - creating a timer
                            
                                count number of black pixels in an image in Python with OpenCV
                            
                                eigenvectors created by numpy.linalg.eig don't seem correct
                            
                                Pyspark changing type of column from date to string
                            
                                xlwings function to find the last row with data
                            
                                Symbol not found: _BIO_new_CMS
                            
                                Align text for OCR
                            
                                How do I change the dtype in TensorFlow for a csv file?
                            
                                Monitoring django rest framework api on production server
                            
                                Attach a queue to a numpy array in tensorflow for data fetch instead of files?
                            
                                How to check for empty request.FILE in Django
                            
                                OpenCV for Python 3.5.1
                            
                                Python: Read hex from file into list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sum values of columns starting with the same string in pandas dataframe

Tags:

python

pandas

dataframe

startswith

Amanda

People also ask

1 Answers

Raghul

Recent Activity

Donate For Us