I have the following table. Some values are NaNs. Let's assume that columns are highly correlated. Taking row 0 and row 5 I say that value in col2 will be 4.0. Same situation for row 1 and row 4. But in case of row 6, there is no perfectly matching sample so I should take most similar row - in this case, row 0 and change NaN to 3.0.
How should I approach it? Is there any pandas function that can do this?
example = pd.DataFrame({"col1": [3, 2, 8, 4, 2, 3, np.nan], 
                        "col2": [4, 3, 6, np.nan, 3, np.nan, 5], 
                        "col3": [7, 8, 9, np.nan, np.nan, 7, 7], 
                        "col4": [7, 8, 9, np.nan, np.nan, 7, 6]})
Output:
    col1    col2    col3    col4
0   3.0     4.0     7.0     7.0
1   2.0     3.0     8.0     8.0
2   8.0     6.0     9.0     9.0
3   4.0     NaN     NaN     NaN
4   2.0     3.0     NaN     NaN
5   3.0     NaN     7.0     7.0
6   NaN     5.0     7.0     6.0
                Use the fillna() Method: The fillna() function iterates through your dataset and fills all null rows with a specified value. It accepts some optional arguments—take note of the following ones: Value: This is the value you want to insert into the missing rows. Method: Lets you fill missing values forward or in reverse.
Using fillna() to fill values from another column Here, we apply the fillna() function on “Col1” of the dataframe df and pass the series df['Col2'] as an argument. The above code fills the missing values in “Col1” with the corresponding values (based on the index) from “Col2”.
Extract rows/columns with missing values in specific columns/rows. You can use the isnull() or isna() method of pandas. DataFrame and Series to check if each element is a missing value or not. isnull() is an alias for isna() , whose usage is the same.
This is a hard question , involved numpy broadcast  , and groupby +  transform , I am using first here , since first will pick up the first not NaN value 
s=df.values
t=np.all((s==s[:,None])|np.isnan(s),-1)
idx=pd.DataFrame(t).where(t).stack().index
# we get the pair for each row
df=df.reindex(idx.get_level_values(1))
# reorder our df to the idx we just get 
df.groupby(level=[0]).transform('first').groupby(level=1).first()
# using two times groupby with first , get what we need .
Out[217]: 
   col1  col2  col3  col4
0   3.0   4.0   7.0   7.0
1   2.0   3.0   8.0   8.0
2   8.0   6.0   9.0   9.0
3   4.0   NaN   NaN   NaN
4   2.0   3.0   8.0   8.0
5   3.0   4.0   7.0   7.0
6   NaN   5.0   7.0   6.0
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With