Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace all non-numeric entries with NaN in a pandas dataframe?

Tags:

python

pandas

I have various csv files and I import them as a DataFrame. The problem is that many files use different symbols for missing values. Some use nan, others NaN, ND, None, missing etc. or just live the entry empty. Is there a way to replace all these values with a np.nan? In other words, any non-numeric value in the dataframe becomes np.nan. Thank you for the help.

like image 365
user6566438 Avatar asked Jan 30 '17 14:01

user6566438


Video Answer


2 Answers

Replacing non-numeric entries on read, the easier (more safe) way

TL;DR: Set a datatype for the column(s) that aren't casting properly, and supply a list of na_values

# Create a custom list of values I want to cast to NaN, and explicitly 
#   define the data types of columns:
na_values = ['None', '(S)', 'S']
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctapi': np.float64}, na_values=na_values)

Longer Explanation

I believe best practices when working with messy data is to:

  • Provide datatypes to pandas for columns whose datatypes are not inferred properly.
  • Explicitly define a list of values that should be cast to NaN.

This is quite easy to do.

Pandas read_csv has a list of values that it looks for and automatically casts to NaN when parsing the data (see the documentation of read_csv for the list). You can extend this list using the na_values parameter, and you can tell pandas how to cast particular columns using the dtypes parameter.

In the example above, pctapi is the name of a column that was casting to object type instead of float64, due to NaN values. So, I force pandas to cast to float64 and provide the read_csv function with a list of values to cast to NaN.

Process I follow

Since data science is often completely about process, I thought I describe the steps I use to create an na_values list and debug this issue with a dataset.

Step 1: Try to import the data and let pandas infer data types. Check if the data types are as expected. If they are = move on.

enter image description here

In the example above, Pandas was right on about half the columns. However, I expected all columns listed below the 'count' field to be of type float64. We'll need to fix this.

Step 2: If data types are not as expected, explicitly set the data types on read using dtypes parameter. This will throw errors by default on values that cannot be cast.

# note: the dtypes dictionary specifying types. pandas will attempt to infer
#   the type of any column name that's not listed
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64})

Here's the error message I receive when running the code above: enter image description here

Step 3: Create an explicit list of values pandas cannot convert and cast them to NaN on read.

From the error message, I can see that pandas was unable to cast the value of (S). I add this to my list of na_values:

# note the new na_values argument provided to read_csv
last_names = pd.read_csv('names_2010_census.csv', dtype={'pctwhite': np.float64}, na_values=['(S)'])

Finally, I repeat steps 2 & 3 until I have a comprehensive list of dtype mappings and na_values.

If you're working on a hobbyist project this method may be more than you need, you may want to use u/instant's answer instead. However, if you're working in production systems or on a team, it's well worth the 10 minutes it takes to correctly cast your columns.

like image 55
Matt Avatar answered Sep 22 '22 02:09

Matt


I found what I think is a relatively elegant but also robust method:

def isnumber(x):
    try:
        float(x)
        return True
    except:
        return False

df[df.applymap(isnumber)]

In case it's not clear: You define a function that returns True only if whatever input you have can be converted to a float. You then filter df with that boolean dataframe, which automatically assigns NaN to the cells you didn't filter for.

Another solution I tried was to define isnumber as

import number
def isnumber(x):
    return isinstance(x, number.Number)

but what I liked less about that approach is that you can accidentally have a number as a string, so you would mistakenly filter those out. This is also a sneaky error, seeing that the dataframe displays the string "99" the same as the number 99.

EDIT:

In your case you probably still need to df = df.applymap(float) after filtering, for the reason that float works on all different capitalizations of 'nan', but until you explicitely convert them they will still be considered strings in the dataframe.

like image 30
instant Avatar answered Sep 23 '22 02:09

instant