I use Python and have an array with values 1.0 , 2.0 , 3.0 , 4.0 , 5.0 , 6.0 and np.nan as NoData.
I want to fill all "nan" with a value. This value should be the majority of the surrounding values.
For example:
1 1 1 1 1
1 n 1 2 2
1 3 3 2 1
1 3 2 3 1
"n" shall present "nan" in this example. The majority of its neighbors have the value 1. Thus, "nan" shall get replaced by value 1.
Note, that the holes consisting of "nan" can be of the size 1 to 5. For example (maximum size of 5 nan):
1 1 1 1 1
1 n n n 2
1 n n 2 1
1 3 2 3 1
Here the hole of "nan" have the following surrounding values:
surrounding_values = [1,1,1,1,1,2,1,2,3,2,3,1,1,1] -> Majority = 1
I tried the following code:
from sklearn.preprocessing import Imputer
array = np.array(.......) #consisting of 1.0-6.0 & np.nan
imp = Imputer(strategy="most_frequent")
fill = imp.fit_transform(array)
This works pretty good. However, it only uses one axis (0 = column, 1 = row). The default is 0 (column), so it uses the majority of the surrounding values of the same column. For example:
Array
2 1 2 1 1
2 n 2 2 2
2 1 2 2 1
1 3 2 3 1
Filled Array
2 1 2 1 1
2 1 2 2 2
2 1 2 2 1
1 3 2 3 1
So here you see, although the majority is 2, the majority of the surrounding column-values is 1 and thus it becomes 1 instead of 2.
As a result, I need to find another method using python. Any suggestions or ideas?
SUPPLEMENT:
Here you see the result, after I added the very helpfull improvement of Martin Valgur.
Think of "0" as sea (blue) and of the other values (> 0) as land (red).
If there is a "little" sea surrounded by land (the sea can again have the size 1-5 px) it will get land, as you can successfully see in the result-image. If the surrounded sea is bigger than 5px or outside the land, the sea wont gain land (This is not visible in the image, because it is not the case).
If there is 1px "nan" with more majority of sea than land, it will still become land (In this example it has 50/50).
The following picture shows what I need. At the border between sea (value=0) and land (value>0), the "nan"-pixel needs to get the value of the majority of the land-values.
That sounds difficult and I hope that I could explain it vividly.
A possible solution using label()
and binary_dilation()
from scipy.ndimage
:
import numpy as np
from scipy.ndimage import label, binary_dilation
from collections import Counter
def impute(arr):
imputed_array = np.copy(arr)
mask = np.isnan(arr)
labels, count = label(mask)
for idx in range(1, count + 1):
hole = labels == idx
surrounding_values = arr[binary_dilation(hole) & ~hole]
most_frequent = Counter(surrounding_values).most_common(1)[0][0]
imputed_array[hole] = most_frequent
return imputed_array
EDIT: Regarding your loosely-related follow-up question, you can extend the above code to achieve what you are after:
import numpy as np
from scipy.ndimage import label, binary_dilation, binary_closing
def fill_land(arr):
output = np.copy(arr)
# Fill NaN-s
mask = np.isnan(arr)
labels, count = label(mask)
for idx in range(1, count + 1):
hole = labels == idx
surrounding_values = arr[binary_dilation(hole) & ~hole]
output[hole] = any(surrounding_values)
# Fill lakes
land = output.astype(bool)
lakes = binary_closing(land) & ~land
labels, count = label(lakes)
for idx in range(1, count + 1):
lake = labels == idx
output[lake] = lake.sum() < 6
return output
i dont found any lib, so i wrote a function, if case all None in the middle of the array you can use these
import numpy as np
from collections import Counter
def getModulusSurround(data):
tempdata = list(filter(lambda x: x, data))
c = Counter(tempdata)
if c.most_common(1)[0][0]:
return(c.most_common(1)[0][0])
def main():
array = [[1, 2, 2, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, None, 6, 7],
[1, 4, 2, 3, 4],
[4, 6, 2, 2, 4]]
array = np.array(array)
for i in range(5):
for j in range(5):
if array[i,j] == None:
temparray = array[i-1:i+2,j-1:j+2]
array[i,j] = getModulusSurround(temparray.flatten())
print(array)
main()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With