I am trying to split my dataframe into two based of medical_plan_id
. If it is empty, into df1
. If not empty into df2
.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]
The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison")
.
df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
How to handle such situation?
My df_with_medicalplanid looks like below:
wellthie_issuer_identifier ... medical_plan_id
0 UHC99806 ... None
1 UHC99806 ... None
==
, not is
, to test equalityLikewise, use !=
instead of is not
for inequality.
is
has a special meaning in Python. It returns True
if two variables point to the same object, while ==
checks if the objects referred to by the variables are equal. See also Is there a difference between ==
and is
in Python?.
The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~
("tilde"), also accessible via operator.invert
, to negate an existing mask.
Equality versus empty strings can be tested via == ''
, but equality versus null values requires a specialized method: pd.Series.isnull
. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan
, and np.nan != np.nan
by design.
If you want to replace empty strings with null values, you can do so:
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
Conceptually, it makes sense for missing values to be null (np.nan
) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:
df['medical_plan_id'] = df['medical_plan_id'].fillna('')
If the difference matters, you need to know your data and apply the appropriate logic.
Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:
mask = df['medical_plan_id'].isnull()
df1 = df[mask]
df2 = df[~mask]
Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy
with dict
to give a dictionary of dataframes with False
(== 0
) and True
(== 1
) keys corresponding to your masks:
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
Then dfs[0]
represents df2
and dfs[1]
represents df1
(see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy
methods:
dfs = df.groupby(df['medical_plan_id'].isnull())
dfs.get_group(0) # equivalent to dfs[0] from dict solution
dfs.get_group(1) # equivalent to dfs[1] from dict solution
Putting all the above in action:
df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
'values': [1, 2, 3, 4, 5, 6, 7]})
df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))
print(dfs[0], dfs[1], sep='\n'*2)
medical_plan_id values
2 2134.0 3
3 4325.0 4
4 6543.0 5
medical_plan_id values
0 NaN 1
1 NaN 2
5 NaN 6
6 NaN 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With