Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splitting a dataframe based on condition

Tags:

python

pandas

I am trying to split my dataframe into two based of medical_plan_id. If it is empty, into df1. If not empty into df2.

df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]
df2 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] is not ""]

The code below works, but if there are no empty fields, my code raises TypeError("invalid type comparison").

df1 = df_with_medicalplanid[df_with_medicalplanid['medical_plan_id'] == ""]

How to handle such situation?

My df_with_medicalplanid looks like below:

wellthie_issuer_identifier       ...       medical_plan_id
0                   UHC99806       ...                  None
1                   UHC99806       ...                  None
like image 757
user1896796 Avatar asked Dec 11 '22 04:12

user1896796


1 Answers

Use ==, not is, to test equality

Likewise, use != instead of is not for inequality.

is has a special meaning in Python. It returns True if two variables point to the same object, while == checks if the objects referred to by the variables are equal. See also Is there a difference between == and is in Python?.

Don't repeat mask calculations

The Boolean masks you are creating are the most expensive part of your logic. It's also logic you want to avoid repeating manually as your first and second masks are inverses of each other. You can therefore use the bitwise inverse ~ ("tilde"), also accessible via operator.invert, to negate an existing mask.

Empty strings are different to null values

Equality versus empty strings can be tested via == '', but equality versus null values requires a specialized method: pd.Series.isnull. This is because null values are represented in NumPy arrays, which are used by Pandas, by np.nan, and np.nan != np.nan by design.

If you want to replace empty strings with null values, you can do so:

df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)

Conceptually, it makes sense for missing values to be null (np.nan) rather than empty strings. But the opposite of the above process, i.e. converting null values to empty strings, is also possible:

df['medical_plan_id'] = df['medical_plan_id'].fillna('')

If the difference matters, you need to know your data and apply the appropriate logic.

Semi-final solution

Assuming you do indeed have null values, calculate a single Boolean mask and its inverse:

mask = df['medical_plan_id'].isnull()

df1 = df[mask]
df2 = df[~mask]

Final solution: avoid extra variables

Creating additional variables is something, as a programmer, you should look to avoid. In this case, there's no need to create two new variables, you can use GroupBy with dict to give a dictionary of dataframes with False (== 0) and True (== 1) keys corresponding to your masks:

dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))

Then dfs[0] represents df2 and dfs[1] represents df1 (see also this related answer). A variant of the above, you can forego dictionary construction and use Pandas GroupBy methods:

dfs = df.groupby(df['medical_plan_id'].isnull())

dfs.get_group(0)  # equivalent to dfs[0] from dict solution
dfs.get_group(1)  # equivalent to dfs[1] from dict solution

Example

Putting all the above in action:

df = pd.DataFrame({'medical_plan_id': [np.nan, '', 2134, 4325, 6543, '', np.nan],
                   'values': [1, 2, 3, 4, 5, 6, 7]})

df['medical_plan_id'] = df['medical_plan_id'].replace('', np.nan)
dfs = dict(tuple(df.groupby(df['medical_plan_id'].isnull())))

print(dfs[0], dfs[1], sep='\n'*2)

   medical_plan_id  values
2           2134.0       3
3           4325.0       4
4           6543.0       5

   medical_plan_id  values
0              NaN       1
1              NaN       2
5              NaN       6
6              NaN       7
like image 144
jpp Avatar answered Dec 22 '22 00:12

jpp