Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between interpolation and imputation?

I just learned that you can handle missing data/ NaN with imputation and interpolation, what i just found is interpolation is a type of estimation, a method of constructing new data points within the range of a discrete set of known data points while imputation is replacing the missing data of the mean of the column. But is there any differences more than that? When is the best practice to use each of them?

like image 343
random student Avatar asked Nov 06 '19 13:11

random student


2 Answers

Interpolation

Interpolation (linear) is basically a straight line between two given points where data points between these two are missing:

  • Two red points are known
  • Blue point is missing

source: wikipedia


Oke nice explanation, but show me with data.

First of all the formula for linear interpolation is the following:

(y1-y0) / (x1-x0)

Let's say we have the three data points from the graph above:

df = pd.DataFrame({'Value':[0, np.NaN, 3]})

   Value
0    0.0
1    NaN
2    3.0

As we can see row 1 (blue point) is missing. So following formula from above:

(3-0) / (2-0) = 1.5

If we interpolate these using the pandas method Series.interpolate:

df['Value'].interpolate()

0    0.0
1    1.5
2    3.0
Name: Value, dtype: float64

For a bigger dataset it would look as follows:

df = pd.DataFrame({'Value':[1, np.NaN, 4, np.NaN, np.NaN,7]})

   Value
0    1.0
1    NaN
2    4.0
3    NaN
4    NaN
5    7.0
df['Value'].interpolate()

0    1.0
1    2.5
2    4.0
3    5.0
4    6.0
5    7.0
Name: Value, dtype: float64

Imputation

When we impute the data with the (arithmetic) mean, we follow the following formula:

sum(all points) / n

So for our second dataframe we get:

(1 + 4 + 7) / 3 = 4

So if we impute our dataframe with Series.fillna and Series.mean:

df['Value'].fillna(df['Value'].mean())

0    1.0
1    4.0
2    4.0
3    4.0
4    4.0
5    7.0
Name: Value, dtype: float64
like image 87
Erfan Avatar answered Nov 15 '22 08:11

Erfan


I will answer the second part of your question i.e. when to use what. We use both techniques depending upon the use case.

Imputation: If you are given a dataset of patients with a disease (say Pneumonia) and there is a feature called body temperature. So, if there are null values for this feature then you can replace it by average value i.e. Imputation.

Interpolation: If you are given a dataset of the share price of a company, you know that every Saturday and Sunday are off. So those are missing values. Now, these values can be filled by the average of Friday value and Monday value i.e. Interpolation.

So, you can choose the technique depending upon the use case.

like image 29
Darshan Jain Avatar answered Nov 15 '22 08:11

Darshan Jain