I'm an avid R user and am learning python along the way. One of the example code that I can easily run in R is perplexing me in Python.
Here's the original data (constructed within R):
library(tidyverse)
df <- tribble(~name, ~age, ~gender, ~height_in,
"john",20,'m',66,
'mary',NA,'f',62,
NA,38,'f',68,
'larry',NA,NA,NA
)
The output of this looks like this:
df
# A tibble: 4 x 4
name age gender height_in
<chr> <dbl> <chr> <dbl>
1 john 20 m 66
2 mary NA f 62
3 NA 38 f 68
4 larry NA NA NA
I want to do 3 things:
Here's how I did it in R (again, using the tidyverse package):
tmp <- df %>%
mutate_if(is.character, function(x) ifelse(is.na(x),"zz",x)) %>%
mutate_if(is.character, as.factor) %>%
mutate_if(is.numeric, function(x) ifelse(is.na(x), 0, x))
Here's the output of the dataframe tmp:
tmp
# A tibble: 4 x 4
name age gender height_in
<fct> <dbl> <fct> <dbl>
1 john 20 m 66
2 mary 0 f 62
3 zz 38 f 68
4 larry 0 zz 0
I'm familiar with if() and else() statements within Python. What I don't know is the correct and most readable way of executing the above code within Python. I'm guessing that there is no mutate_if equivalent in the pandas package. My question is what is the similar code that I can use in python that mimics the mutate_if, is.character, is.numeric, and as.factor functions found within tidyverse and R?
On a side note, I'm not as interested in speed/efficiency of code execution, but rather readability - which is why I really enjoy tidyverse. I would be grateful for any tips or suggestions.
Edit 1: adding code to create a pandas dataframe
Here is the code I used to create the dataframe within Python. This may assist others in getting started.
import pandas as pd
import numpy as np
my_dict = {
'name' : ['john','mary', np.nan, 'larry'],
'age' : [20, np.nan, 38, np.nan],
'gender' : ['m','f','f', np.nan],
'height_in' : [66, 62, 68, np.nan]
}
df = pd.DataFrame(my_dict)
The output of this should be similar:
print(df)
name age gender height_in
0 john 20.0 m 66.0
1 mary NaN f 62.0
2 NaN 38.0 f 68.0
3 larry NaN NaN NaN
Well, after some sleep, I think I have it figured out.
Here's the code I used to take the pandas dataframe and apply the comparable mutate_if functions I mentioned earlier to get the same results.
# fill in the missing values (similar to mutate_if from tidyverse)
df1 = df.select_dtypes(include=['double']).fillna(0)
df2 = df.select_dtypes(include=['object']).fillna('zz').astype('category')
df = pd.concat([df2.reset_index(drop = True), df1], axis = 1)
print(df)
name gender age height_in
0 john m 20.0 66.0
1 mary f 0.0 62.0
2 zz f 38.0 68.0
3 larry zz 0.0 0.0
# check again for the data types
df.dtypes
name category
gender category
age float64
height_in float64
dtype: object
The catch is that I had to 'break' apart the original dataframe, apply the changes (i.e., fill in the missing values and change data types), and then recombine the columns (i.e., put the data frame back together).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With