Working in Python3.
Say you have a million beetles, and your task is to catalogue the size of their spots. So you will make a table, where each row is a beetle and the number in the row represent the size of spots;
[[.3, 1.2, 0.5],
[.6, .7],
[1.4, .9, .5, .7],
[.2, .3, .1, .7, .1]]
Also, you decide to store this in a numpy array, for which you pad the lists with None (numpy will convert this to np.nan).
[[.3, 1.2, 0.5, None, None],
[.6, .7, None, None, None],
[1.4, .9, .5, .7, None],
[.2, .3, .1, .7, .1]]
But there is a problem, values represented as None can be None for one of 3 reasons;
The beetle dosn't have many spots; that quantity does not exist.
The beetle won't stay still and you can't measure the spot.
You haven't got round to measuring that beetle yet, so the value is unassigned.
My problem doesn't actually involve beetles, but the principles are the same. I want 3 different None values so I can keep these missing value causes distinct. My current solution is to use a value so large that it is physically improbable, but this is not a very safe solution.
Assume you cannot use negative numbers - in reality the quantity I am measuring could be negative.
The data is big and read speed is important.
Edit; comments rightly point out that saying speed is important without saying what operations is a bit meaningless. Principle component analysis is probably going to be used for variable decorrilation, Euclidean distance squared calculations for a clustering algorithm (but the data is sparse in that variable) possibly some interpolation. Eventually a recursive neural network, but that will come from a library so I will just have to but the data into an input form. So maybe nothing worse than linear algebra, it should all fit in the RAM if I am careful I think.
What is a good strategy?
Also called the “law of parsimony”, Occam's razor is a mental model which states that “it is futile to do with more what can be done with fewer”—in other words, the simplest explanation is most likely the right one.
Occam's razor is a principle often attributed to … 14th century friar William of Ockham that says that if you have two competing ideas to explain the same phenomenon, you should prefer the simpler one.
Examples of Occam's razor“You have a headache?”, “Oh no… you might have the Black Death!” Sure, it's true that one of the symptoms of the Black Death is a headache but, using Occam's razor, it's obviously much more likely that you're dehydrated or suffering from a common cold.
Because it can lack firmness and consistency when applied to complex ideas or phenomena, Occam's razor is more commonly seen as a guiding heuristic than as a principle of absolute truth.
The simplest way to go would be with strings: 'not counted', 'unknown' and 'N/A'. However if you want to process quickly in numpy, arrays with mixed numbers/objects are not your friend.
My suggestion would be to add several arrays of the same shape as your data, consisting of 0 and 1. So the array missing
= 1 where spot is missing else 0, and so on, same with array not_measured
, etc..
Then you can use NaNs everywhere, and later mask your data with, say, np.where(missing == 1)
to easily find the specific NaNs you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With