Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read_excel interference between na_values and converters bug?

I have found some strange behavior when using the Pandas read_excel function. I have a contrived example of a spreadsheet that looks like:

Name    age weight
Bob     35   70
Alice   15   ...
Sam     ...  65
Susan   92   63
Harold  42   ...

Missing values are indicated by '...' and I would like them read as NaN. I would also like to convert the weights to grams. My first attempt was:

df=pd.read_excel('test.xls',na_values='...',converters={'weight':lambda y:y*1000})

Unfortunately, this applies the converter function first, so that Susan and Harold's missing weights become not three dots but 3000 dots, which does not match the given "na_value" entry and are thus not turned into NaN. This is highly unexpected.

Is there a reason this is the best behavior for this function?

Did I do something obviously wrong in my implementation of it?

What's the best way to get the behavior I want?

like image 884
bigred_bluejay Avatar asked Oct 30 '25 09:10

bigred_bluejay


1 Answers

The converters are run before na_values are applied, which is why you're seeing this behavior.
Just add a condition in your converter to catch the ... instances in weights - like adding a hand-rolled na_values:

df = pd.read_excel("test.xls", 
                   na_values="...", 
                   converters={"weight":lambda w: w*1000 if isinstance(w, int) else None})

print(df)

     Name   age   weight
0     Bob  35.0  70000.0
1   Alice  15.0      NaN
2     Sam   NaN  65000.0
3   Susan  92.0  63000.0
4  Harold  42.0      NaN

Keep the actual na_values argument to ensure ... values in the age field are also converted.

like image 121
andrew_reece Avatar answered Nov 02 '25 00:11

andrew_reece



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!