Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fill Holes with Majority of Surrounding Values (Python)

I use Python and have an array with values 1.0 , 2.0 , 3.0 , 4.0 , 5.0 , 6.0 and np.nan as NoData.

I want to fill all "nan" with a value. This value should be the majority of the surrounding values.

For example:

1 1 1 1 1
1 n 1 2 2
1 3 3 2 1
1 3 2 3 1

"n" shall present "nan" in this example. The majority of its neighbors have the value 1. Thus, "nan" shall get replaced by value 1.

Note, that the holes consisting of "nan" can be of the size 1 to 5. For example (maximum size of 5 nan):

1 1 1 1 1
1 n n n 2
1 n n 2 1
1 3 2 3 1

Here the hole of "nan" have the following surrounding values:

surrounding_values = [1,1,1,1,1,2,1,2,3,2,3,1,1,1] -> Majority = 1

I tried the following code:

from sklearn.preprocessing import Imputer

array = np.array(.......)   #consisting of 1.0-6.0 & np.nan
imp = Imputer(strategy="most_frequent")
fill = imp.fit_transform(array)

This works pretty good. However, it only uses one axis (0 = column, 1 = row). The default is 0 (column), so it uses the majority of the surrounding values of the same column. For example:

Array
2 1 2 1 1
2 n 2 2 2
2 1 2 2 1
1 3 2 3 1

Filled Array
2 1 2 1 1
2 1 2 2 2
2 1 2 2 1
1 3 2 3 1

So here you see, although the majority is 2, the majority of the surrounding column-values is 1 and thus it becomes 1 instead of 2.

As a result, I need to find another method using python. Any suggestions or ideas?


SUPPLEMENT:

Here you see the result, after I added the very helpfull improvement of Martin Valgur.

enter image description here

Think of "0" as sea (blue) and of the other values (> 0) as land (red).

If there is a "little" sea surrounded by land (the sea can again have the size 1-5 px) it will get land, as you can successfully see in the result-image. If the surrounded sea is bigger than 5px or outside the land, the sea wont gain land (This is not visible in the image, because it is not the case).

If there is 1px "nan" with more majority of sea than land, it will still become land (In this example it has 50/50).

The following picture shows what I need. At the border between sea (value=0) and land (value>0), the "nan"-pixel needs to get the value of the majority of the land-values.

enter image description here

That sounds difficult and I hope that I could explain it vividly.

like image 378
Johannes-R-Schmid Avatar asked Jan 09 '17 15:01

Johannes-R-Schmid


2 Answers

A possible solution using label() and binary_dilation() from scipy.ndimage:

import numpy as np
from scipy.ndimage import label, binary_dilation
from collections import Counter

def impute(arr):
    imputed_array = np.copy(arr)

    mask = np.isnan(arr)
    labels, count = label(mask)
    for idx in range(1, count + 1):
        hole = labels == idx
        surrounding_values = arr[binary_dilation(hole) & ~hole]
        most_frequent = Counter(surrounding_values).most_common(1)[0][0]
        imputed_array[hole] = most_frequent

    return imputed_array

EDIT: Regarding your loosely-related follow-up question, you can extend the above code to achieve what you are after:

import numpy as np
from scipy.ndimage import label, binary_dilation, binary_closing

def fill_land(arr):
    output = np.copy(arr)

    # Fill NaN-s
    mask = np.isnan(arr)
    labels, count = label(mask)
    for idx in range(1, count + 1):
        hole = labels == idx
        surrounding_values = arr[binary_dilation(hole) & ~hole]
        output[hole] = any(surrounding_values)

    # Fill lakes
    land = output.astype(bool)
    lakes = binary_closing(land) & ~land
    labels, count = label(lakes)
    for idx in range(1, count + 1):
        lake = labels == idx
        output[lake] = lake.sum() < 6

    return output
like image 51
Martin Valgur Avatar answered Sep 23 '22 08:09

Martin Valgur


i dont found any lib, so i wrote a function, if case all None in the middle of the array you can use these

import numpy as np
from collections import Counter


def getModulusSurround(data):

    tempdata = list(filter(lambda x: x, data))
    c = Counter(tempdata)
    if c.most_common(1)[0][0]:
        return(c.most_common(1)[0][0])


def main():

    array = [[1, 2, 2, 4, 5],
             [2, 3, 4, 5, 6],
             [3, 4, None, 6, 7],
             [1, 4, 2, 3, 4],
             [4, 6, 2, 2, 4]]

    array = np.array(array)

    for i in range(5):
        for j in range(5):
            if array[i,j] == None:

                temparray = array[i-1:i+2,j-1:j+2]
                array[i,j] = getModulusSurround(temparray.flatten())

    print(array)

main()
like image 42
Po Stevanus Andrianta Avatar answered Sep 23 '22 08:09

Po Stevanus Andrianta