Reasonable way to have different versions of None?

Tags:

Working in Python3.

Say you have a million beetles, and your task is to catalogue the size of their spots. So you will make a table, where each row is a beetle and the number in the row represent the size of spots;

 [[.3, 1.2, 0.5],
  [.6, .7],
  [1.4, .9, .5, .7],
  [.2, .3, .1, .7, .1]]

Also, you decide to store this in a numpy array, for which you pad the lists with None (numpy will convert this to np.nan).

 [[.3, 1.2, 0.5, None, None],
  [.6, .7, None, None, None],
  [1.4, .9, .5, .7, None],
  [.2, .3, .1, .7, .1]]

But there is a problem, values represented as None can be None for one of 3 reasons;

The beetle dosn't have many spots; that quantity does not exist.
The beetle won't stay still and you can't measure the spot.
You haven't got round to measuring that beetle yet, so the value is unassigned.

My problem doesn't actually involve beetles, but the principles are the same. I want 3 different None values so I can keep these missing value causes distinct. My current solution is to use a value so large that it is physically improbable, but this is not a very safe solution.

Assume you cannot use negative numbers - in reality the quantity I am measuring could be negative.

The data is big and read speed is important.

Edit; comments rightly point out that saying speed is important without saying what operations is a bit meaningless. Principle component analysis is probably going to be used for variable decorrilation, Euclidean distance squared calculations for a clustering algorithm (but the data is sparse in that variable) possibly some interpolation. Eventually a recursive neural network, but that will come from a library so I will just have to but the data into an input form. So maybe nothing worse than linear algebra, it should all fit in the RAM if I am careful I think.

What is a good strategy?

731

asked Mar 14 '19 15:03

Clumsy cat

1 Answers

The simplest way to go would be with strings: 'not counted', 'unknown' and 'N/A'. However if you want to process quickly in numpy, arrays with mixed numbers/objects are not your friend.

My suggestion would be to add several arrays of the same shape as your data, consisting of 0 and 1. So the array missing = 1 where spot is missing else 0, and so on, same with array not_measured, etc..

Then you can use NaNs everywhere, and later mask your data with, say, np.where(missing == 1) to easily find the specific NaNs you need.

130

answered Sep 21 '22 02:09

Josh Friedlander

Related questions
                            
                                Write BigQuery results to GCS in CSV format using Apache Beam
                            
                                TypeError: __init__() got an unexpected keyword argument 'trainable'
                            
                                Keras: Use categorical_crossentropy without one-hot encoded array of targets
                            
                                Reverse for '' not found. '' is not a valid view function or pattern name - DJANGO
                            
                                Download a csv from url and make it a dataframe python pandas
                            
                                How should I convert a float32 image to an uint8 image?
                            
                                Pytorch Exception in Thread: ValueError: signal number 32 out of range
                            
                                inconsistent use of tabs and spaces in indentation notepad++ Python
                            
                                Python-camelot (Error: GhostscriptNotFound while it is installed)
                            
                                How to add vertical grid lines in a grouped boxplot in Seaborn?
                            
                                Not able to load English language module of spacy with spacy.load('en')
                            
                                Select rows from a Pandas DataFrame with same values in one column but different value in the other column
                            
                                Sort list of strings in natural fashion
                            
                                Pyserial: "module 'serial' has no attribute 'tools'"
                            
                                What does [i,:] mean in Python?
                            
                                Simplest way to connect WiFi python
                            
                                Delete the first column from a csv file in Python [duplicate]
                            
                                Concat tensors in PyTorch
                            
                                Find previous value of a variable in Python
                            
                                Not getting the heatmap in the background using Matplotlib Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Reasonable way to have different versions of None?

Tags:

python

python-3.x

nonetype

numpy

Clumsy cat

People also ask

1 Answers

Josh Friedlander

Recent Activity

Donate For Us