I have found some strange behavior when using the Pandas read_excel function. I have a contrived example of a spreadsheet that looks like:
Name age weight
Bob 35 70
Alice 15 ...
Sam ... 65
Susan 92 63
Harold 42 ...
Missing values are indicated by '...' and I would like them read as NaN. I would also like to convert the weights to grams. My first attempt was:
df=pd.read_excel('test.xls',na_values='...',converters={'weight':lambda y:y*1000})
Unfortunately, this applies the converter function first, so that Susan and Harold's missing weights become not three dots but 3000 dots, which does not match the given "na_value" entry and are thus not turned into NaN. This is highly unexpected.
Is there a reason this is the best behavior for this function?
Did I do something obviously wrong in my implementation of it?
What's the best way to get the behavior I want?
The converters are run before na_values are applied, which is why you're seeing this behavior.
Just add a condition in your converter to catch the ... instances in weights -
like adding a hand-rolled na_values:
df = pd.read_excel("test.xls",
na_values="...",
converters={"weight":lambda w: w*1000 if isinstance(w, int) else None})
print(df)
Name age weight
0 Bob 35.0 70000.0
1 Alice 15.0 NaN
2 Sam NaN 65000.0
3 Susan 92.0 63000.0
4 Harold 42.0 NaN
Keep the actual na_values argument to ensure ... values in the age field are also converted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With