I have the following dataframe:
d = {'col1': [1, "Avoid", 3, "Avoid"], 'col2': ["AA", "BB", "Avoid", "Avoid"]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 AA
1 Avoid BB
2 3 Avoid
3 Avoid Avoid
I have to conditionally concat col1 and col2 into col3. Conditions:
The end result is supposed to look as the following:
col1 col2 col3
0 1 AA 1 & AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid
How can I do this without dealing with for loops?
Run a list comprehension on the strings in plain python:
out = [f"{l}&{r}"
if 'Avoid' not in {l, r}
else 'Avoid'
for l, r in zip(df.col1, df.col2)]
df.assign(col3 = out)
col1 col2 col3
0 1 AA 1&AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid
Is not an efficient way to work with pandas, but if you can't change the data structure these are solutions:
Solution 1:
def custom_merge(cols):
if cols["col1"]=="Avoid" or cols["col2"]=="Avoid":
return "Avoid"
else:
return f"{cols['col1']} & {cols['col2']}"
df['col3'] = df.apply(custom_merge, axis=1)
Solution 2:
df['col3'] = (df["col1"].astype(str) + " & " + df["col2"].astype(str)).apply(lambda x: "Avoid" if 'Avoid' in x else x)
Both solutions results in the following:
col1 col2 col3
0 1 AA 1 & AA
1 Avoid BB Avoid
2 3 Avoid Avoid
3 Avoid Avoid Avoid
Execution Time comparison
In this section I will count the execution times of the proposed solutions. @mozway proposed 2 very tricky solutions in his answer, which I will call Solution 3a and Solution 3b. Another interesting solution is @sammywemmy's solution that uses a list comprehension and then adds the list to the dataframe, I will call this solution solution 4
The instances of the experiments will have the following structure:
import pandas as pd
n = 100000
d = {'col1': [1, "Avoid", 3, "Avoid"] * n, 'col2': ["AA", "BB", "Avoid", "Avoid"] * n}
df = pd.DataFrame(data=d)
Execution time:
Solution 1:
%timeit 3.56 s ± 71.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
%timeit 140 ms ± 1.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 3a:
%timeit 3.44 s ± 77.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 3b:
%timeit 893 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 4:
%timeit 191 ms ± 5.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Solution 1 and solution 3a have similar execution times on the instance under consideration. Solution 3b runs 4 times faster than solutions 1 and 3a. The fastest solution is solution 2 which goes about 7 times faster than solution 2 and about 25 times faster than solutions 1 and 3a. This is because solution 2 takes advantage of pandas vectorization. Solution 4 has a similar runtime as solution 2, taking advantage of list comprehension for the merge operation (without using pandas).
TIPS: If you can change the data format, the advice is to format the data so that you can use the native pandas functions to do the join operations or if you can't change the data format and can do without pandas you might have a slight speed up over solution 2 using dictionaries or lists, doing the merge operation using list comphrehnsion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With