Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reasonable way to have different versions of None?

Working in Python3.

Say you have a million beetles, and your task is to catalogue the size of their spots. So you will make a table, where each row is a beetle and the number in the row represent the size of spots;

 [[.3, 1.2, 0.5],
  [.6, .7],
  [1.4, .9, .5, .7],
  [.2, .3, .1, .7, .1]]

Also, you decide to store this in a numpy array, for which you pad the lists with None (numpy will convert this to np.nan).

 [[.3, 1.2, 0.5, None, None],
  [.6, .7, None, None, None],
  [1.4, .9, .5, .7, None],
  [.2, .3, .1, .7, .1]]

But there is a problem, values represented as None can be None for one of 3 reasons;

  1. The beetle dosn't have many spots; that quantity does not exist.

  2. The beetle won't stay still and you can't measure the spot.

  3. You haven't got round to measuring that beetle yet, so the value is unassigned.

My problem doesn't actually involve beetles, but the principles are the same. I want 3 different None values so I can keep these missing value causes distinct. My current solution is to use a value so large that it is physically improbable, but this is not a very safe solution.

Assume you cannot use negative numbers - in reality the quantity I am measuring could be negative.

The data is big and read speed is important.

Edit; comments rightly point out that saying speed is important without saying what operations is a bit meaningless. Principle component analysis is probably going to be used for variable decorrilation, Euclidean distance squared calculations for a clustering algorithm (but the data is sparse in that variable) possibly some interpolation. Eventually a recursive neural network, but that will come from a library so I will just have to but the data into an input form. So maybe nothing worse than linear algebra, it should all fit in the RAM if I am careful I think.

What is a good strategy?

like image 731
Clumsy cat Avatar asked Mar 14 '19 15:03

Clumsy cat


People also ask

What is Occam's razor fallacy?

Also called the “law of parsimony”, Occam's razor is a mental model which states that “it is futile to do with more what can be done with fewer”—in other words, the simplest explanation is most likely the right one.

What is Occam's theory?

Occam's razor is a principle often attributed to … 14th century friar William of Ockham that says that if you have two competing ideas to explain the same phenomenon, you should prefer the simpler one.

What's an example of Occam's razor?

Examples of Occam's razor“You have a headache?”, “Oh no… you might have the Black Death!” Sure, it's true that one of the symptoms of the Black Death is a headache but, using Occam's razor, it's obviously much more likely that you're dehydrated or suffering from a common cold.

Is Occam's razor valid?

Because it can lack firmness and consistency when applied to complex ideas or phenomena, Occam's razor is more commonly seen as a guiding heuristic than as a principle of absolute truth.


1 Answers

The simplest way to go would be with strings: 'not counted', 'unknown' and 'N/A'. However if you want to process quickly in numpy, arrays with mixed numbers/objects are not your friend.

My suggestion would be to add several arrays of the same shape as your data, consisting of 0 and 1. So the array missing = 1 where spot is missing else 0, and so on, same with array not_measured, etc..

Then you can use NaNs everywhere, and later mask your data with, say, np.where(missing == 1) to easily find the specific NaNs you need.

like image 130
Josh Friedlander Avatar answered Sep 21 '22 02:09

Josh Friedlander