Suppose I have a DataFrame such as:
col1 col2
0 1 A
1 2 B
2 6 A
3 5 C
4 9 C
5 3 A
6 5 B
And multiple lists such as:
list_1 = [1, 2, 4]
list_2 = [3, 8]
list_3 = [5, 6, 7, 9]
I can update the value of col2
depending on whether the value of col1
is included in a list, for example:
for i in list_1:
df.loc[df.col1 == i, 'col2'] = 'A'
for i in list_2:
df.loc[df.col1 == i, 'col2'] = 'B'
for i in list_3:
df.loc[df.col1 == i, 'col2'] = 'C'
However this is very slow. With a dataframe of 30,000 rows, and each list containing approx 5,000-10,000 items, it can take a long time to calculate, especially compared to other pandas operations. Is there a better (faster) way of doing this?
Results. From the above, we can see that for summation, the DataFrame implementation is only slightly faster than the List implementation. This difference is much more pronounced for the more complicated Haversine function, where the DataFrame implementation is about 10X faster than the List implementation.
You can use isin
with np.select
here:
df['col2'] = (np.select([df['col1'].isin(list_1),
df['col1'].isin(list_2),
df['col1'].isin(list_3)]
,['A','B','C']))
With Map
:
d = dict(zip(map(tuple,[list_1,list_2,list_3]),['A','B','C']))
df['col2'] = df['col1'].map({val: v for k,v in d.items() for val in k})
col1 col2
0 1 A
1 2 A
2 6 C
3 5 C
4 9 C
5 3 B
6 5 C
You can first convert the lists to dicts and then map to col1.
d1 = {k:'A' for k in list_1}
d2 = {k:'B' for k in list_2}
d3 = {k:'C' for k in list_3}
df['col2'] = (
df.col1.apply(lambda x: d1.get(x,x))
.combine_first(df.col1.apply(lambda x: d2.get(x,x)))
.combine_first(df.col1.apply(lambda x: d2.get(x,x)))
)
If there is no duplicates in the lists, you can make it even faster by merging them to a single dict:
d = {**{k:'A' for k in list_1},
**{k:'B' for k in list_2},
**{k:'C' for k in list_3}}
df['col2'] = df.col1.apply(lambda x: d.get(x,x))
I would suggest iterating through your lists with a dictionary using conditional updating:
# Create your update dictionary
col_dict = {
"A":[1, 2, 4],
"B":[3, 8],
"C":[5, 6, 7, 9]
}
# Iterate and update
for key, value in col_dict.items():
# key is the col name; value is the lookup list
df["col2"] = np.where(df["col1"].isin(value), key, df["col2"])
There is a concern of overwriting values – since a row can technically match multiple lists. How those updates are reconciled is not obvious.
If rows don't match multiple keys, consider a dynamic programming approach where a running index of "unmatched" rows are used for each iteration, updating as your proceed so that the number of rows you're iterating through are fewer with each iteration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With