Pandas read_excel interference between na_values and converters bug?

Question

I have found some strange behavior when using the Pandas read_excel function. I have a contrived example of a spreadsheet that looks like:

Name    age weight
Bob     35   70
Alice   15   ...
Sam     ...  65
Susan   92   63
Harold  42   ...

Missing values are indicated by '...' and I would like them read as NaN. I would also like to convert the weights to grams. My first attempt was:

df=pd.read_excel('test.xls',na_values='...',converters={'weight':lambda y:y*1000})

Unfortunately, this applies the converter function first, so that Susan and Harold's missing weights become not three dots but 3000 dots, which does not match the given "na_value" entry and are thus not turned into NaN. This is highly unexpected.

Is there a reason this is the best behavior for this function?

Did I do something obviously wrong in my implementation of it?

What's the best way to get the behavior I want?

andrew_reece · Accepted Answer

The converters are run before na_values are applied, which is why you're seeing this behavior.
Just add a condition in your converter to catch the ... instances in weights - like adding a hand-rolled na_values:

df = pd.read_excel("test.xls", 
                   na_values="...", 
                   converters={"weight":lambda w: w*1000 if isinstance(w, int) else None})

print(df)

     Name   age   weight
0     Bob  35.0  70000.0
1   Alice  15.0      NaN
2     Sam   NaN  65000.0
3   Susan  92.0  63000.0
4  Harold  42.0      NaN

Keep the actual na_values argument to ensure ... values in the age field are also converted.

Pandas read_excel interference between na_values and converters bug?

Tags:

python

pandas

excel

bigred_bluejay

1 Answers

andrew_reece

Recent Activity

Donate For Us

Pandas read_excel interference between na_values and converters bug?

Tags:

python

pandas

excel

bigred_bluejay

1 Answers

andrew_reece

Related questions

Recent Activity

Donate For Us