Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe computations

I am trying compute a metric with panda dataframes. In particular, I get a results object

prediction = results.predict(start=1,end=len(test),exog=test)

The actual values are in a dataframe given by

test['actual']. 

I need to compute two things:

  1. How can I compute the sum of squares of errors? So basically, I would be doing an element by element subtraction and then summing the squares of these.

  2. How can I compute the sum of squares of the predicted minus the mean of the actual values? So it would be

    (x1-mean_actual)^2 + (x2-mean_actual)^2...+(xn-mean_actual)^2
    
like image 618
user1802143 Avatar asked Nov 26 '13 06:11

user1802143


People also ask

How are pandas calculated?

A Percentage is calculated by the mathematical formula of dividing the value by the sum of all the values and then multiplying the sum by 100. This is also applicable in Pandas Dataframes. Here, the pre-defined sum() method of pandas series is used to compute the sum of all the values of a column.

How do you get a statistical summary of a DataFrame DF?

Summarizing Data The describe() function computes a summary of statistics pertaining to the DataFrame columns. This function gives the mean, std and IQR values. And, function excludes the character columns and given summary about numeric columns.

Is pandas faster than PySpark?

Due to parallel execution on all cores on multiple machines, PySpark runs operations faster than Pandas, hence we often required to covert Pandas DataFrame to PySpark (Spark with Python) for better performance. This is one of the major differences between Pandas vs PySpark DataFrame.


1 Answers

First one would be

((prediction - test['actual']) ** 2).sum()

Second one would be:

((prediction - test['actual'].mean()) ** 2).sum()
like image 104
Roman Pekar Avatar answered Oct 15 '22 17:10

Roman Pekar