What is the difference between interpolation and imputation?

Question

I just learned that you can handle missing data/ NaN with imputation and interpolation, what i just found is interpolation is a type of estimation, a method of constructing new data points within the range of a discrete set of known data points while imputation is replacing the missing data of the mean of the column. But is there any differences more than that? When is the best practice to use each of them?

Erfan · Accepted Answer

Interpolation

Interpolation (linear) is basically a straight line between two given points where data points between these two are missing:

Two red points are known
Blue point is missing

source: wikipedia

Oke nice explanation, but show me with data.

First of all the formula for linear interpolation is the following:

(y1-y0) / (x1-x0)

Let's say we have the three data points from the graph above:

df = pd.DataFrame({'Value':[0, np.NaN, 3]})

   Value
0    0.0
1    NaN
2    3.0

As we can see row 1 (blue point) is missing. So following formula from above:

(3-0) / (2-0) = 1.5

If we interpolate these using the pandas method Series.interpolate:

df['Value'].interpolate()

0    0.0
1    1.5
2    3.0
Name: Value, dtype: float64

For a bigger dataset it would look as follows:

df = pd.DataFrame({'Value':[1, np.NaN, 4, np.NaN, np.NaN,7]})

   Value
0    1.0
1    NaN
2    4.0
3    NaN
4    NaN
5    7.0

df['Value'].interpolate()

0    1.0
1    2.5
2    4.0
3    5.0
4    6.0
5    7.0
Name: Value, dtype: float64

Imputation

When we impute the data with the (arithmetic) mean, we follow the following formula:

sum(all points) / n

So for our second dataframe we get:

(1 + 4 + 7) / 3 = 4

So if we impute our dataframe with Series.fillna and Series.mean:

df['Value'].fillna(df['Value'].mean())

0    1.0
1    4.0
2    4.0
3    4.0
4    4.0
5    7.0
Name: Value, dtype: float64

Darshan Jain · Answer

I will answer the second part of your question i.e. when to use what. We use both techniques depending upon the use case.

Imputation: If you are given a dataset of patients with a disease (say Pneumonia) and there is a feature called body temperature. So, if there are null values for this feature then you can replace it by average value i.e. Imputation.

Interpolation: If you are given a dataset of the share price of a company, you know that every Saturday and Sunday are off. So those are missing values. Now, these values can be filled by the average of Friday value and Monday value i.e. Interpolation.

So, you can choose the technique depending upon the use case.

What is the difference between interpolation and imputation?

Tags:

python-3.x

pandas

random student

2 Answers

Interpolation

Oke nice explanation, but show me with data.

Imputation

Erfan

Darshan Jain

Recent Activity

Donate For Us

What is the difference between interpolation and imputation?

Tags:

python-3.x

pandas

random student

2 Answers

Interpolation

Oke nice explanation, but show me with data.

Imputation

Erfan

Darshan Jain

Related questions

Recent Activity

Donate For Us