Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sum values of columns starting with the same string in pandas dataframe

I have a dataframe with about 100 columns that looks like this:

   Id  Economics-1  English-107  English-2  History-3  Economics-zz  Economics-2  \
0  56          1            1          0        1       0           0   
1  11          0            0          0        0       1           0   
2   6          0            0          1        0       0           1   
3  43          0            0          0        1       0           1   
4  14          0            1          0        0       1           0   

   Histo      Economics-51      Literature-re         Literatureu4  
0           1            0           1                0  
1           0            0           0                1  
2           0            0           0                0  
3           0            1           1                0  
4           1            0           0                0  

My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":

    Id  Economics      English    History  Literature  
0  56          1            1          2        1                     
1  11          1            0          0        1                    
2   6          0            1          1        0                     
3  43          2            0          1        1                     
4  14          0            1          1        0          

For this purpose, I have tried two methods. First method:

df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]

Second method:

df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
    df['History'] = df[filter_col].sum(axes=1)
    print df['History', df[filter_col]]

However, both gives the error:

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed

My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.

like image 866
Amanda Avatar asked Mar 02 '16 12:03

Amanda


People also ask

How do I sum a column with the same name?

To sum a column using a named range:Highlight the range of data that you want to name. Under the Formulas menu, click Define Name. Enter a name for your range and click OK. Select the cell where you want the sum to appear, and type =SUM( followed by the name of your range and a closed bracket.

How can you get the sum of values of a column in pandas DataFrame?

The sum() method adds all values in each column and returns the sum for each column. By specifying the column axis ( axis='columns' ), the sum() method searches column-wise and returns the sum of each row.

How do I sum only certain columns in pandas?

Calculate Sum of Given Columns To sum given or list of columns then create a list with all columns you wanted and slice the DataFrame with the selected list of columns and use the sum() function. Use df['Sum']=df[col_list]. sum(axis=1) to get the total sum.

How do I sum multiple columns in pandas?

Use DataFrame. groupby(). sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which contains an aggregate function sum() to calculate a sum of a given column for each group.


1 Answers

You can use these to create sum of columns starting with specific name,

df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)
like image 101
Raghul Avatar answered Oct 13 '22 18:10

Raghul