Filling missing values with values from most similar row

Tags:

I have the following table. Some values are NaNs. Let's assume that columns are highly correlated. Taking row 0 and row 5 I say that value in col2 will be 4.0. Same situation for row 1 and row 4. But in case of row 6, there is no perfectly matching sample so I should take most similar row - in this case, row 0 and change NaN to 3.0. How should I approach it? Is there any pandas function that can do this?

example = pd.DataFrame({"col1": [3, 2, 8, 4, 2, 3, np.nan], 
                        "col2": [4, 3, 6, np.nan, 3, np.nan, 5], 
                        "col3": [7, 8, 9, np.nan, np.nan, 7, 7], 
                        "col4": [7, 8, 9, np.nan, np.nan, 7, 6]})

Output:

    col1    col2    col3    col4
0   3.0     4.0     7.0     7.0
1   2.0     3.0     8.0     8.0
2   8.0     6.0     9.0     9.0
3   4.0     NaN     NaN     NaN
4   2.0     3.0     NaN     NaN
5   3.0     NaN     7.0     7.0
6   NaN     5.0     7.0     6.0

911

asked May 08 '19 16:05

MarkAlanFrank

1 Answers

This is a hard question , involved numpy broadcast , and groupby + transform , I am using first here , since first will pick up the first not NaN value

s=df.values
t=np.all((s==s[:,None])|np.isnan(s),-1)
idx=pd.DataFrame(t).where(t).stack().index
# we get the pair for each row
df=df.reindex(idx.get_level_values(1))
# reorder our df to the idx we just get 
df.groupby(level=[0]).transform('first').groupby(level=1).first()
# using two times groupby with first , get what we need .
Out[217]: 
   col1  col2  col3  col4
0   3.0   4.0   7.0   7.0
1   2.0   3.0   8.0   8.0
2   8.0   6.0   9.0   9.0
3   4.0   NaN   NaN   NaN
4   2.0   3.0   8.0   8.0
5   3.0   4.0   7.0   7.0
6   NaN   5.0   7.0   6.0

165

answered Oct 25 '22 05:10

BENY

Related questions
                            
                                How to remove first occurrence of a letter?
                            
                                error when implementing a tensorflow input pipeline with tf.data
                            
                                os.path AttributeError: 'str' object has no attribute 'exists'
                            
                                How to change date format for Django logging?
                            
                                Website to be scraped has varying class names
                            
                                How to break string in lines only based on \n in python3?
                            
                                New line in text in Plotly
                            
                                Most pythonic way to declare inner functions
                            
                                Gekko Non-Linear optimization, object type error in constraint function evaluating if statement
                            
                                Unpickling saved pytorch model throws AttributeError: Can't get attribute 'Net' on <module '__main__' despite adding class definition inline
                            
                                How to subset Numpy array with exclusion
                            
                                What's the purpose of version in "Programming Language" classifier of setuptools?
                            
                                Why doesn't deepcopy of a Pandas DataFrame affect memory usage?
                            
                                Why val_loss and val_acc are not displaying?
                            
                                Is there a way to prevent plotnine from printing user warnings when saving ggplot objects to a file?
                            
                                How to overwrite __repr__ method for an already-instantiated class python
                            
                                Incremental Counter flag for a matching condition on subsequent time series data
                            
                                Sorting 2D numpy array using indices returned from np.argsort() [duplicate]
                            
                                Combining asyncio with a multi-worker ProcessPoolExecutor and for async
                            
                                Tensorflow, expected conv2d_input to have 4 dimensions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filling missing values with values from most similar row

Tags:

python

pandas

data-science

MarkAlanFrank

People also ask

1 Answers

BENY

Recent Activity

Donate For Us