Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filling missing values with values from most similar row

I have the following table. Some values are NaNs. Let's assume that columns are highly correlated. Taking row 0 and row 5 I say that value in col2 will be 4.0. Same situation for row 1 and row 4. But in case of row 6, there is no perfectly matching sample so I should take most similar row - in this case, row 0 and change NaN to 3.0. How should I approach it? Is there any pandas function that can do this?

example = pd.DataFrame({"col1": [3, 2, 8, 4, 2, 3, np.nan], 
                        "col2": [4, 3, 6, np.nan, 3, np.nan, 5], 
                        "col3": [7, 8, 9, np.nan, np.nan, 7, 7], 
                        "col4": [7, 8, 9, np.nan, np.nan, 7, 6]})

Output:

    col1    col2    col3    col4
0   3.0     4.0     7.0     7.0
1   2.0     3.0     8.0     8.0
2   8.0     6.0     9.0     9.0
3   4.0     NaN     NaN     NaN
4   2.0     3.0     NaN     NaN
5   3.0     NaN     7.0     7.0
6   NaN     5.0     7.0     6.0
like image 911
MarkAlanFrank Avatar asked May 08 '19 16:05

MarkAlanFrank


People also ask

What is a good way to fill in missing values in a dataset?

Use the fillna() Method: The fillna() function iterates through your dataset and fills all null rows with a specified value. It accepts some optional arguments—take note of the following ones: Value: This is the value you want to insert into the missing rows. Method: Lets you fill missing values forward or in reverse.

How do you fill missing values based on another column?

Using fillna() to fill values from another column Here, we apply the fillna() function on “Col1” of the dataframe df and pass the series df['Col2'] as an argument. The above code fills the missing values in “Col1” with the corresponding values (based on the index) from “Col2”.

How do you find missing values in a row?

Extract rows/columns with missing values in specific columns/rows. You can use the isnull() or isna() method of pandas. DataFrame and Series to check if each element is a missing value or not. isnull() is an alias for isna() , whose usage is the same.


1 Answers

This is a hard question , involved numpy broadcast , and groupby + transform , I am using first here , since first will pick up the first not NaN value

s=df.values
t=np.all((s==s[:,None])|np.isnan(s),-1)
idx=pd.DataFrame(t).where(t).stack().index
# we get the pair for each row
df=df.reindex(idx.get_level_values(1))
# reorder our df to the idx we just get 
df.groupby(level=[0]).transform('first').groupby(level=1).first()
# using two times groupby with first , get what we need .
Out[217]: 
   col1  col2  col3  col4
0   3.0   4.0   7.0   7.0
1   2.0   3.0   8.0   8.0
2   8.0   6.0   9.0   9.0
3   4.0   NaN   NaN   NaN
4   2.0   3.0   8.0   8.0
5   3.0   4.0   7.0   7.0
6   NaN   5.0   7.0   6.0
like image 165
BENY Avatar answered Oct 25 '22 05:10

BENY