NumPy or Pandas: Keeping array type as integer while having a NaN value

People also ask

How does NumPy array deal with NaN values?

The most common way to do so is by using the . fillna() method. This method requires you to specify a value to replace the NaNs with.

Does NumPy support NaN?

No, you can't, at least with current version of NumPy. A nan is a special value for float arrays only.

What datatype is NaN in pandas?

nan . In Working with missing data, we saw that pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point.

NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:

http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na

(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support )

This capability has been added to pandas (beginning with version 0.24): https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support

At this point, it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lowercase).

If performance is not the main issue, you can store strings instead.

df.col = df.col.dropna().apply(lambda x: str(int(x)) )

Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.

You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.

This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN

a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)

This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected

Pandas v0.24+

Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.

Pandas v0.23 and earlier

In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.

The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:

s = pd.Series([1, 2, 3, np.nan])

print(s.astype(object))

0      1
1      2
2      3
3    NaN
dtype: object

For cosmetic reasons, e.g. output to a file, this may be preferable.

Pandas v0.23 and earlier: background

NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:

In the absence of high performance NA support being built into NumPy from the ground up, the primary casualty is the ability to represent NAs in integer arrays.

This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”.

The docs also provide rules for upcasting due to NaN inclusion:

Typeclass   Promotion dtype for storing NAs
floating    no change
object      no change
integer     cast to float64
boolean     cast to object

This is now possible, since pandas v 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values.

Just wanted to add that in case you are trying to convert a float (1.143) vector to integer (1) that has NA converting to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"

s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error 
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0      1
1      2
2    NaN
dtype: Int64

My use case is that I have a float series that I want to round to int, but when you do .round() a '*.0' at the end of the number remains, so you can drop that 0 from the end by converting to int.

Related questions
                            
                                How to get all of the immediate subdirectories in Python
                            
                                Pythonic way to find maximum value and its index in a list?
                            
                                Instance attribute attribute_name defined outside __init__
                            
                                collapse cell in jupyter notebook
                            
                                How do I know if a generator is empty from the start?
                            
                                pandas: multiple conditions while indexing data frame - unexpected behavior
                            
                                How can I make sense of the `else` clause of Python loops?
                            
                                Insert at first position of a list in Python [closed]
                            
                                How to upload file with python requests?
                            
                                Python: List vs Dict for look up table
                            
                                Counting array elements in Python [duplicate]
                            
                                How to sort Counter by value? - python
                            
                                How to load a tsv file into a Pandas DataFrame?
                            
                                How do I get a list of column names from a psycopg2 cursor?
                            
                                How to choose an AWS profile when using boto3 to connect to CloudFront
                            
                                How to erase the file contents of text file in Python?
                            
                                How to print a dictionary line by line in Python?
                            
                                APT command line interface-like yes/no input?
                            
                                How can I scroll a web page using selenium webdriver in python?
                            
                                Parse a .py file, read the AST, modify it, then write back the modified source code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

NumPy or Pandas: Keeping array type as integer while having a NaN value

Tags:

python

type-conversion

int

pandas

numpy

People also ask

Pandas v0.24+

Pandas v0.23 and earlier

Pandas v0.23 and earlier: background

Recent Activity

Donate For Us